AWS Kinesis Outage
On Wednesday, November 25, 2020, AWS suffered an outage of their Kinesis service in the us-east-1 region. These RCA (Root Cause Analysis) write ups are always illuminating. Not just as a peek inside how AWS services works, but as a chance for the industry as a whole to peer review best practices. I had a few key takeaways from my experience in Observability that I want to point out.
At 5:15 AM PST, the first alarms began firing for errors on putting and getting Kinesis records. Teams engaged and began reviewing logs.
Cheers! This is a great example of a team doing excellent alerting from their telemetry. I’ve been asked how I would design monitoring and alerting for teams from the ground up – always a tricky question. I start by observing the customer experience. This may be from direct observations of incoming requests, their timings, and success rate. This should also be done by simulating traffic to confirm the results are as expected. This forms the primary alarms that can wake the on-call person.
I’ve worked with a couple of teams to try to help them dig out of bad on-call rotation. I define a bad on-call rotation if there are 10 or more high priority pages per week. Usually, this is caused by over alerting on specific causes and a loss of focus on who the customer is. Specific, caused based alerts, in most cases, should be low priority tickets that are reviewed and fixed the next business day.
I was also glad to see that AWS’s Kinesis metrics fired alerts and that began the process of using those alerts and then the log event data to diagnose the problem. Metric systems are great at recording measurements, distributions, and firing alerts. But its the log data that tells you why an API call failed and leads you to the recorded exception. This is a super important pattern often misunderstood when creating an Observability plan. An alert by itself cannot and will not always link you to a specific cause – and that’s okay. It should, however, send you to that service’s dashboard where its health and its logs can be easily found.
We are adding fine-grained alarming for thread consumption in the service.
Ah, disappointment. Not even AWS is immune to the demand that “this event must never happen again.” This is, in my experience, what leads to over alerting and bad on-call rotations. Because, this event probably wont happen again in the same way. Lighting rarely strikes twice in the same place. This leads to a collection of ill-understood alerts around a plethora of causes that lead to fragility, rather than a helpful on-call rotation.
Another note of disappointment comes as so many reactions on the Internet have boiled down to “I can’t believe that AWS doesn’t monitor ulimits and thread counts!” There is a case to be made that the standard instrumentation and dashboards used with the services written should have metrics for specific limits. Like the standard file descriptor metrics in the Prometheus client libraries:
process_max_fds
process_open_fds
These limits should always be exported with a current value and max value so a ratio can easily be generated by the standard dashboards used with all applications. I bet this just got added to the Kinesis front-end servers and their metrics so that threads can more easily be monitored!
However, its impossible to, by default or standard practice, instrument, export, and build dashboards around all the possible permutations here. At least not without adding a lot of cardinality to Prometheus’s standard metrics. We don’t know from the RCA (although there are some hints) of how the thread consumption was being controlled. Perhaps ulimits, but more likely a cgroup limit. (Do you monitor all your cgroup limits?) But it could have been a more subtle issue like stack size limits, or a number of tunables in the Linux kernel’s PID, VM, or POSIX thread management. I’m willing to take a bet this was not just a simple case of the maximum user processes being limited to 1024 and no one was watching.
So, what should be done when an event like this happens to you? AWS has already checked off building a time line of the event and running a RCA (or postmortem) to understand what happened and highlight action items to prevent similar issues in the future. But that shouldn’t end up with action items to “alert on this cause.” However, this did highlight a critical design limitation of the application – that it requires $n^2$ threads. Thread consumption got on a dashboard somewhere, and I bet that it got included in a calculation of saturation of the service too.