Are you familiar with the term “chaos engineering?” If this is the first time, you’ve heard it, it probably won’t be the last time.
Chaos engineering (CE) is a new approach to resiliency testing that might end up having a big impact on how we business continuity professionals carry out our work of ensuring the recoverability of our organizations’ business processes and IT environments.
In today’s post, I’ll give you a quick introduction to the movement and methodology of chaos engineering.
Future posts will look at the potential impacts of CE on business continuity and IT/Disaster Recovery (IT/DR).
The discipline of chaos engineering can be summed up in six words: break stuff and see what happens.
Chaos engineering is a pursuit with the goal of increasing the resiliency of complex computing and software systems. It can also potentially be used to strengthen other types of systems.
It emerged from the recognition that our growing dependence on our computing and network environments—together with their increasing complexity and the increasingly high costs associated with interruptions to those systems—called for greater system resiliency and hence a more rigorous approach to system testing and design.
The main idea of chaos engineering is that by throwing various types of wrenches into the production environment, and seeing how the system responds, you can learn truly and accurately where your vulnerabilities are—and then you can shore them up, removing that vulnerability and increasing the resiliency of the system.
The main danger, obviously, is that in throwing wrenches into your production environment you will harm your production environment, causing unpredictable and potentially serious problems where it counts.
This is why it is said that chaos engineering is easy to understand but hard to do.
Chaos engineering originated at Netflix in 2011 with the creation of a software tool called a Chaos Monkey. Chaos Monkeys were designed to be released into the company’s systems where they would behave in a manner similar to that of a wild, armed monkey turned loose in a data center or cloud environment.
The Monkey would cause random damage, and the system would then attempt to contain, mitigate, and work around that damage.
The purpose of turning these virtual wrecking balls loose in their systems was to identify weaknesses and strengthen resiliency. The ultimate goal was to minimize the impact of the inevitable software and hardware failures on the end-user video streaming and viewing experience.
Chaos Monkeys were so effective in helping the company probe and strengthen system resiliency that over time it developed a whole suite of similar tools, dubbed the Simian Army. The suite includes the Chaos Gorilla, Donkey Monkey, Security Monkey, and other tools.
In recent years, the concept of chaos engineering has spread from Netflix to other tech companies like Google and Amazon. It now seems to poised to gain a foothold in non-technology firms.
The chaos engineering community is based on a handful of core concepts which are set forth on the website Principles of Chaos, which was initiated by Netflix.
As the site says, chaos engineering experiments are intended to “uncover systemic weaknesses” and follow four steps:
If the steady state is hard to disrupt, then great. That’s grounds for having confidence in the system. When your experiments uncover weaknesses, put fixing them on your to-do list, so you can correct the problem before it flares up in the larger system.
The Principles of Chaos website also sets forth a number of “Advanced Principles” for doing chaos engineering. These include:
According to the Principles of Chaos website, there is a strong correlation between how rigorously the above principles are followed and the confidence that can be placed in the system.
There are obvious parallels between the work of these chaos engineers and the work we do as business continuity and IT/DR professionals. It is likely our field can benefit from the approaches they have pioneered.
In future posts, I’ll look at how the principles and practice of chaos engineering are likely to impact and improve the practice of BC and IT/DR in non-tech organizations.
Another key resource on chaos engineering is the ebook Chaos Engineering: Building Confidence in System Behavior through Experiments, which was written by a team of Netflix engineers and is available for free at the link through O’Reilly Media.