Principals Of Chaos
Is your infrastructure ready to withstand all the chaos the internet can throw at you?
I stumbled across a cool little website called principlesofchaos.org. It is a website dedicated to “Chaos Testing” which, in case you didn’t know, is a technique for testing the resilience of a system. In my case AWS Infrastructure. It’s what Cloud War Games is all about.
You should read it but it breaks down as follows:
Define “Steady State”:
This is what the system should look like when functioning normally.
Define your “Control Group” vs the “Experiment Group”:
I wouldn’t recommend doing Stage vs Prod as that is kind of measuring apples to grapes. You need two similarly provisioned envs with similar load. A good ole A/B prod swap should do the trick.
Introduce Real World Server Fires:
This is the fun part. Delete some ECS tasks or otherwise make them unhealthy. Knock over an entire Availability Zone to see how the system reacts.
Measure The Differences:
See how the performance differs between the 2 infrastructures. Was there down time? Was there a spike in latency?
Question:
How are you chaos testing your infrastructure?