Lessons from the 10/20/2025 AWS Outage


One of the big secondary issues we saw during last week's outage was a massive build up in the various queues, partially because you couldn’t provision compute power to power the queues and partially for another reason entirely.

Before Dominic realized half the internet was on fire, he was trying to get a video editing software to render a video for him. Rending a video is a fairly computationally intense operation.

Dominic, being an extremely persistent individual, continued to click and click and click, queuing up countless render jobs in the queue. Being that the video rendering service was still struggling on Wednesday, a full 2-days after the outage had ended, I am guessing they did not have a great process for dealing with a massive backlog of jobs in the queue. This is not that uncommon.

What can we learn from this?

First, don't let people continuously queue up jobs. Have some type of check to see if a job is currently queued up and stop queuing up new jobs if one already exists.

Second, have a solid process for queuing your queues and make sure you know what can be cleared and what can’t.

A transaction for a sale absolutely needs to go through or the product won’t get shipped.

A render request for a video from 2 days earlier likely can get cleared. The user will click render next time they log in.

Design your architecture accordingly. If you need help with this feel free to reach out to me.