What Is Chaos Engineering and Why Should I Care?

Written by Michael Herrera | Aug 16, 2018 9:05:54 AM

Are you familiar with the term “chaos engineering?” If this is the first time, you’ve heard it, it probably won’t be the last time.

Chaos engineering (CE) is a new approach to resiliency testing that might end up having a big impact on how we business continuity professionals carry out our work of ensuring the recoverability of our organizations’ business processes and IT environments.

In today’s post, I’ll give you a quick introduction to the movement and methodology of chaos engineering.

Future posts will look at the potential impacts of CE on business continuity and IT/Disaster Recovery (IT/DR).

BUILDING RESILIENCY

The discipline of chaos engineering can be summed up in six words: break stuff and see what happens.

Chaos engineering is a pursuit with the goal of increasing the resiliency of complex computing and software systems. It can also potentially be used to strengthen other types of systems.

It emerged from the recognition that our growing dependence on our computing and network environments—together with their increasing complexity and the increasingly high costs associated with interruptions to those systems—called for greater system resiliency and hence a more rigorous approach to system testing and design.

THROWING WRENCHES

The main idea of chaos engineering is that by throwing various types of wrenches into the production environment, and seeing how the system responds, you can learn truly and accurately where your vulnerabilities are—and then you can shore them up, removing that vulnerability and increasing the resiliency of the system.

The main danger, obviously, is that in throwing wrenches into your production environment you will harm your production environment, causing unpredictable and potentially serious problems where it counts.

This is why it is said that chaos engineering is easy to understand but hard to do.

BIRTH OF THE CHAOS MONKEY

Chaos engineering originated at Netflix in 2011 with the creation of a software tool called a Chaos Monkey. Chaos Monkeys were designed to be released into the company’s systems where they would behave in a manner similar to that of a wild, armed monkey turned loose in a data center or cloud environment.

The Monkey would cause random damage, and the system would then attempt to contain, mitigate, and work around that damage.

The purpose of turning these virtual wrecking balls loose in their systems was to identify weaknesses and strengthen resiliency. The ultimate goal was to minimize the impact of the inevitable software and hardware failures on the end-user video streaming and viewing experience.

Chaos Monkeys were so effective in helping the company probe and strengthen system resiliency that over time it developed a whole suite of similar tools, dubbed the Simian Army. The suite includes the Chaos Gorilla, Donkey Monkey, Security Monkey, and other tools.

In recent years, the concept of chaos engineering has spread from Netflix to other tech companies like Google and Amazon. It now seems to poised to gain a foothold in non-technology firms.

HOW IT WORKS

The chaos engineering community is based on a handful of core concepts which are set forth on the website Principles of Chaos, which was initiated by Netflix.

As the site says, chaos engineering experiments are intended to “uncover systemic weaknesses” and follow four steps:

Start by defining “steady state” as some measurable output of a system that indicates normal behavior.
Hypothesize that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real-world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

If the steady state is hard to disrupt, then great. That’s grounds for having confidence in the system. When your experiments uncover weaknesses, put fixing them on your to-do list, so you can correct the problem before it flares up in the larger system.

MINIMIZING THE “BLAST RADIUS”

The Principles of Chaos website also sets forth a number of “Advanced Principles” for doing chaos engineering. These include:

Build a Hypothesis around Steady State Behavior. By focusing on systemic behavior patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works.
Vary Real-world Events. Prioritize events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment.
Run Experiments in Production. To guarantee the authenticity of the methods used when you exercise the system and their relevance to the currently deployed system, Chaos strongly prefers to experiment directly on production traffic.
Automate Experiments to Run Continuously. Running experiments manually is labor-intensive and ultimately unsustainable. Automate experiments and run them continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis.
Minimize Blast Radius. Experimenting in production has the potential to cause unnecessary customer pain. While there must be an allowance for some short-term negative impact, it is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained.

According to the Principles of Chaos website, there is a strong correlation between how rigorously the above principles are followed and the confidence that can be placed in the system.

CHAOS ENGINEERING AND BUSINESS CONTINUITY

There are obvious parallels between the work of these chaos engineers and the work we do as business continuity and IT/DR professionals. It is likely our field can benefit from the approaches they have pioneered.

In future posts, I’ll look at how the principles and practice of chaos engineering are likely to impact and improve the practice of BC and IT/DR in non-tech organizations.