Let's Talk: AWS October Outage and AI Observability

The AWS us-east-1 outage taught me something important: with all their AI investments, humans still had to save the day. Maybe that’s why Andy Jassy fired 30,000 people.

Bottom Line Up Front (BLUF)

Observability is technique, not hype. It’s not about buying the latest shiny tool that promises everything. It’s about building the foundations that let problem solvers do their jobs. Invest in technique, not empty promises.

Playing Root Cause AI BINGO

The AWS outage of DynamoDB in us-east-1 this October was painful for teams and companies worldwide. First up, if you haven’t thanked an SRE or DevOps team member for their hard work—do it. Do it now and come back. I’ll wait. These folks worked tirelessly to bring our businesses back to life.

When the dust settled, and after we all made the obligatory joke that it’s always DNS (it wasn’t), I had Claude build me a fresh AI BINGO card to read the root cause analysis with. You gotta do something to make these things readable. Surely, a FAANG company would highlight how AI was crucial in figuring out the cause and saving the day.

Yeah, I didn’t win. In fact, the only real reference to anything AI-related is in the list of services affected. Amazon Connect (think cloud-based customer support operations) was affected and heavily uses AI in its interactions with customers. So, by taking out foundational services, at least we learned how to take out AI?

Lesson 1: DynamoDB is AI’s Achilles Heel.

When I Had a Head-Desk Moment

This was when I knew I’d seen this story before: Race conditions.

For resiliency, the DNS Enactor operates redundantly and fully independently in three different Availability Zones (AZs).

AWS is the largest cloud provider on the planet. The scale they operate at is beyond anything else we have. These folks basically invent how to go to the Moon and back every day. What I’m saying is: although the class of error encountered is a traditional computer science problem, the scale of operations makes these problems incredibly difficult to reason about, localize, and correct. There are no other examples to crib from. So surely, AI automated the complete problem-solving apparatus here—they must have left this detail out of the document. Some DevOps or SRE agentic approach? No, it was humans and “manual operator intervention” who solved the puzzle and what they did with that information.

Lesson 2: AWS Still Uses Manual Intervention.

What Methods Were Used?

The human-centeredness of the Root Cause Analysis (RCA) really stood out.

“This situation ultimately required manual operator intervention to correct.”
“Engineering teams for impacted AWS services were immediately engaged and began to investigate.”
“…our engineers had identified DynamoDB’s DNS state as the source of the outage.”
“Since this situation had no established operational recovery procedure, engineers took care…”
“Engineers worked to reduce the load…to accelerate recovery.”

The timings of the events in the RCA back this up.

11:48 PM PDT – DynamoDB Issue Occurred
12:38 AM PDT – DynamoDB Issue Identified
01:15 AM PDT – DynamoDB Temporary Mitigations
02:25 AM PDT – DynamoDB Issue Fully Repaired, EC2 Began Recovery
02:40 AM PDT – DynamoDB Solution Fully Deployed
04:14 AM PDT – EC2 Temporary Mitigations
05:28 AM PDT – EC2 Issue Repaired
06:21 AM PDT – EC2 Network Manager Issue Occurred
10:36 AM PDT – EC2 Network Manager Repaired
11:23 AM PDT – EC2 Throttles Relaxed
01:50 PM PDT – EC2 Solution Fully Deployed
02:20 PM PDT – All AWS Services Fully Recovered and Deployed

So about 15 hours of torture, if you ask an SRE. Specifically, human torture both inside and outside AWS. Obviously, an outage affecting DynamoDB cascaded into many other services, but that was the root cause. Even looking at DynamoDB alone, that’s 3 hours with who knows how many engineers to repair. Wouldn’t AI have spotted the race condition faster? It debugs my 3-file code project in seconds! Where was the cutting edge technology? How much time did it save from diagnosing the incident?

Lesson 3: AI Was Likely Hallucinating When We Needed It Most.

We Measure This

In O11y-land, we measure incidents. We observe them.

Mean Time To Detect (MTTD)

We don’t know this from the RCA document. This is the time between when an incident actually occurs and when monitoring notices a problem. All we know officially is that at 11:48 PM on October 19, DynamoDB’s error budget burn was above acceptable limits – the incident was in progress.

Mean Time to Identify (MTTI)

This is the time between detection and identifying the issue at hand. 12:38 AM PDT. 50 minutes to identify the root cause of the race condition in the DNS Planner and DNS Enactor.

Mean Time to Repair (MTTR)

This metric tracks from detection how long it took to fully repair the issue. Notably, this doesn’t mean the issue is fixed from the customer’s point of view, but that repairs are made. 2:25 AM PDT, so the time to repair is 2h37m.

Mean Time to Recovery (MTTR)—Yes, There Are Like 3 Rs!

This is the time from detection to when customers see a full recovery of the service. In this case, DNS cache timeouts played the biggest role. This is 15m in addition to the Time to Repair, or 2h52m.

Industry speaking, those are actually really reasonable numbers for a major outage. We can do the same exercise for the cascading outages or the whole shebang. I’m sure the engineers in AWS have and you can read some of the mitigations they have already enacted to make sure this specific race condition does not reoccur. However, the next bit of news we find is from October 27th. The bombshell of 30,000 layoffs.

Lesson 4: AWS Used to Have Awesome Engineers.

What Do We Learn From the AWS Outage?

This is the largest cloud provider, a FAANG company. They plan to spend more than $100 BILLION with a B on AI investments. I can only assume they’re eager to show off what AI can do with all this investment to better sell their products to us SREs. But the evidence isn’t here when we needed to see it most. Perhaps there’s something to be said for humans after all.

Amazon follows this outage with the announcement that 30,000 corporate employees are getting the ax. Because artificial intelligence tools. I hope that doesn’t include the engineers who figured out how to recover from these issues.

Before the ink dried on the root cause analysis, AWS had another outage in us-east-1. October 28th, just nine days later. The outage lasted from 9:00 AM PDT to 10:43 PM PDT—12 hours of operational pain. The same services hit by the October 19-20 outage: EC2 and ECS. Is this a pattern?

As a software engineer, I have one message that I want to communicate toward folks at AWS: THIS IS NOT A GOOD LOOK. Because “AI,” you’ve chosen to abandon your employees while your reliability of critical and foundational services plummets.

Lesson 5: Reliability is hard. Reliability is expensive. Reliability depends on observability. Observability depends on process and technique. Even at AWS scale Reliability depends on people.

Let’s Wrap Up

Claude, can you proof read my post for grammar, spelling, tone, and flow?

LinuxCzar