Red Herrings Before Black Friday


Red Herrings Before Black Friday

Last week was an interesting one for me. One of my larger clients started seeing elevated latency right before the biggest sales weekend of the year.

Day 1: On Monday I reached out to them to see if we still wanted to meet for our normal office hours on Wednesday the day before Thanksgiving. I check because clients often take the day before a big holiday off or at least cut out early. They responded by saying 100% absolutely we needed to meet on Wednesday because of the latency issue. I replied by saying why not meet now? I would rather not wait until the 11th hour to debug the latency issues.

We got on a call and found part of our problem was that an NLB stopped taking traffic on one of four of our availability zones. This means that 25% of the servers we were paying for were not doing anything. I found the exact CloudTrail event that caused this problem. It was innocent looking enough; just an update to a target group but somehow that caused the NLB to fall off. Day 2: The solution turned out to basically be “turn it off and back on again”. In reality all we needed to do was force a deployment and that updated another target group and we were back in business. Day 3: The fix from Day 2 helped with latency across the platform but was still not fully within the bounds that we normally operate. After digging in again I found an abnormal amount of requests going to a small domain that my clients operate. To put this in perspective, think of how Pepsi owns a bunch of other brands like Mt Dew, Taco Bell, and KFC. Well think of it like this: The equivalent of Mr.Pib.com was getting a ton of traffic when it normally only gets about .01% of the traffic and for some reason it was slowing down the rest of the domains that make up 99.99% of the normal traffic.

At a glance it might have looked like an attack, until one of my client’s top team members pointed out that it was getting allowed by WAF as “search engine” traffic. It was Google crawling our smallest domain for Black Friday updates. We put in a temporary fix for that and BOOM we were back in business.

The latency graph was a thing of beauty. Imagine a Richter scale graph during an earthquake violently swinging back and forth then majestically flattening once the earthquake has passed. So Black Friday/Cyber Monday was saved and all.