A Solution To AWS ECS's Software Version Consistency Problems


Did AWS's Software Version Consistency update cause your ECS tasks to blow up?

If so, I may have a solution for you.
In case you missed it, here is my post about AWS ECS's Software Version Consistency Problems.

Problem

The problem was that we had some long-running background workers running in ECS, and when a deployment happened, ECS was killing off the long-running tasks before they could complete.

The ideal solution would be to have the tasks save their state so they could pick up where they left off, or to have those background jobs running on AWS Batch.
For reasons I won’t dive into, neither of these were options for this project.

If this is not your problem, then this might not be the solution for you.

Solution

Working with AWS Premium Support, we were able to fix our issue using AWS Task Scale-In Protection.

Basically, you can tell ECS that specific ECS tasks (NOT entire ECS services) are "Protected." This means that during an ECS deployment and/or ECS autoscaling event, these ECS tasks cannot be scaled down for a short duration.
The default duration is 2 hours, but protection can be extended to up to 48 hours.

It worked as described, but here are some additional things I observed while experimenting with this:

  1. ECS tasks are only protected from ECS deployments and ECS autoscaling.
  2. Even though the task is "protected," I could still terminate an ECS task via the console or command line.
  3. If the ECS task's main CMD process is terminated, the task itself will be terminated. It doesn’t keep running like a zombie task, which is a good thing.

It’s actually pretty simple. Deciding which ECS tasks should have protection turned on, and applying it to every task, is the difficult part.

Either way, best of luck, and if you need some help with this, I’m happy to assist.

~Cheers!