Let's Talk: AWS October Outage and AI Observability
The AWS us-east-1 outage taught me something important: with all their AI
investments, humans still had to save the day. Maybe that’s why Andy Jassy
fired 30,000 people.
The AWS us-east-1 outage taught me something important: with all their AI
investments, humans still had to save the day. Maybe that’s why Andy Jassy
fired 30,000 people.
I’d like to announce my new venture, Cardinality Cloud, LLC. I’ve always wanted to take what I’ve done for hyper-growth companies like Palo Alto Networks and Fitbit and make it more available to everyone. Readers of the site will not be surprised that I’m specializing in Site Reliability Engineering (SRE) and Observability consulting with a focus on cost effective and Open Source setups. I am also launching products aimed to make that even more easier. The first one is the Prometheus Alert and SLO Generator! Please take a look and let me know what you think.
Its been a long time since I’ve done a presentation in front of real live people. I was honored to be selected to be a Speaker at Monitorama PDX 2023 that was held last week. I put together a talk on some of the challenges that I, and many others, face in the quest to improve Observability. I wanted to touch on some mathematical concepts that underline choices that I make, and why the Four Golden Signals really matter. I close out this talk with an experiment to do the impossible – to aggregate quantiles together to form larger quantiles. Take a listen to find out what happened.
When working with Software Engineering teams to improve their observability, I often find that working with a method helps tremendously. A method with a short and catchy acronym really drives the lessons home. Soon I’ve got management writing notes about my acronyms and including them in planning meetings. Methods have a winning madness. Now let’s use them for logging too.
Java isn’t my favorite language to work in. However, I realized that to roll out a successful Observability plan that I needed good examples and most of the teams I work with create Spring Boot applications. So off I went to create a simple Java Spring Boot application that demonstrated a structured logging approach, metrics following the 4 Golden Signals pattern, and integration with basic tracing. The goal is to show how to pivot through one tool’s data to another. The problem was working with Prometheus’s Exemplars.
Charity Majors, from Honeycomb.io, recently wrote “Observability: The 5-Year Retrospective.” In it Charity takes a look back, attempts to define what Observability actually is, and lays out a set of capabilities any Observability platform must have. Charity’s vision helps all of us understand what we strive for in creating better software and services. However, I propose that there is a wide gap between the visionaries (or vendors) and the real Observability requirements for teams on the ground. Observability is a practice and not a product. It is time to define observability from a practitioner’s point of view.
If you use SSH keys and haven’t migrated to the newer ED25519 Twisted Edwards curve key pairs – well you should. It is presently the most recommended key type. Faster and possibly more secure than RSA key types. Even though this type has been supported by OpenSSH for a number of years now, there are still some tricks to have up your sleeve.
The best thing about using the Prometheus Operator to manage Prometheus in Kubernetes is the CRDs. Alert rules can be managed directly by the application Helm Charts, FluxCD, or your Cloud Native pipeline of choice. For me, for now, that is Helm Charts. The only problem is that both Helm and the Prometheus Alert Rules use Golang Text Templates. The question is: how does one keep the Helm Chart templates from getting ugly and super complex?
Its been quiet around LinuxCzar. Mostly because I’ve been thinking about keyboards again. This line of thought closely follows career changes. When opportunity knocks – as they say.
I first learned how to program in a computer lab full of IBM PS/2 Model 25 machines, each with its own SSK form factor Model M buckling spring keyboard. I’m amazed at my memory of a lab full of those SSK keybaords (which are better known as the tenkeyless or TKL layouts today). Those go for quite a penny on eBay as of this writing. Usually about the same price as what an entire Model 25 goes for.
On Wednesday, November 25, 2020, AWS suffered an outage of their Kinesis service in the us-east-1 region. These RCA (Root Cause Analysis) write ups are always illuminating. Not just as a peek inside how AWS services works, but as a chance for the industry as a whole to peer review best practices. I had a few key takeaways from my experience in Observability that I want to point out.
At 5:15 AM PST, the first alarms began firing for errors on putting and getting Kinesis records. Teams engaged and began reviewing logs.