AWS us-east-1 - The internet’s Achilles heel


Unless you have been living under a rock, you probably noticed that a good chunk of the internet went out due to a massive AWS DNS issue in us-east-1 on Monday.

They say it only “disrupted” one service, Dynamo DB but it “Impacted” a whopping 141 services.

This included IAM which is what is required to login to the AWS console making it rather difficult to even login to AWS to see where “it” is hitting the fan. So a lot of people were flying blind.

This initial issue was resolved fairly quickly actually but the problem rippled causing secondary issues in the form of a massive build up of queued events for SQS, AWS Batch and more.

This issue shined a light on a few things:

Despite the internet being a giant decentralized network, an amazing amount of it is served up from AWS. This means a catastrophic failure at AWS can break an amazing amount of services that people depend on to go about their daily lives.

Second seemingly small issues can snowball into much larger problems due to the complex intertwined nature of these systems.

How can you avoid disruption in the future?

You could go multi-region. That is not just multiple availability zones, but full out multi-region. This means your application layer, data layer, queues and everything else are hosted across multiple regions. Some of it in Ohio, some in Virginia, some in California, heck even in Canada or across the planet.

My clients that hosted in Ohio didn’t even notice there was an outage until I reached out to them.

If you wanted to really hedge your bets you could go fully multi-cloud but that comes with a whole other layer of costs in the form of self-hosting a lot of services that each cloud provider has managed services for.

For example, RDS is great at saving you an insane amount of engineering hours patching DBs and keeping them up to date.

As of yet, I don’t know of a cross cloud managed service for this. This means your engineers will have to go back to the old days of applying painful patches but it's worse because they will have to do it to systems running on multiple cloud providers.

I am not telling you to go multi cloud or not, I am mainly saying make sure you calculate the ROI on that investment before pulling the trigger.

If you want to dive in deeper, yesterday’s podcast was on this subject.