Principals Of Chaos


Is your infrastructure ready to withstand all the chaos the internet can throw at you?

I stumbled across a cool little website called principlesofchaos.org. It is a website dedicated to “Chaos Testing” which, in case you didn’t know, is a technique for testing the resilience of a system. In my case AWS Infrastructure. It’s what Cloud War Games is all about.

You should read it but it breaks down as follows:

Define “Steady State”:

This is what the system should look like when functioning normally.

Define your “Control Group” vs the “Experiment Group”:

I wouldn’t recommend doing Stage vs Prod as that is kind of measuring apples to grapes. You need two similarly provisioned envs with similar load. A good ole A/B prod swap should do the trick.

Introduce Real World Server Fires:

This is the fun part. Delete some ECS tasks or otherwise make them unhealthy. Knock over an entire Availability Zone to see how the system reacts.

Measure The Differences:

See how the performance differs between the 2 infrastructures. Was there down time? Was there a spike in latency?

Question:

How are you chaos testing your infrastructure?