Let's Talk: AWS October Outage and AI Observability

The AWS us-east-1 outage taught me something important: with all their AI investments, humans still had to save the day. Maybe that’s why Andy Jassy fired 30,000 people.

Read the Full Article

Cardinality Cloud and Prometheus SLOs

October 7, 2025 Operations prometheus cardinality_cloud

I’d like to announce my new venture, Cardinality Cloud, LLC. I’ve always wanted to take what I’ve done for hyper-growth companies like Palo Alto Networks and Fitbit and make it more available to everyone. Readers of the site will not be surprised that I’m specializing in Site Reliability Engineering (SRE) and Observability consulting with a focus on cost effective and Open Source setups. I am also launching products aimed to make that even more easier. The first one is the Prometheus Alert and SLO Generator! Please take a look and let me know what you think.

Read the Full Article

Monitorama PDX 2023: Finding Π in Observability

July 1, 2023 Operations logs prometheus talks

Its been a long time since I’ve done a presentation in front of real live people. I was honored to be selected to be a Speaker at Monitorama PDX 2023 that was held last week. I put together a talk on some of the challenges that I, and many others, face in the quest to improve Observability. I wanted to touch on some mathematical concepts that underline choices that I make, and why the Four Golden Signals really matter. I close out this talk with an experiment to do the impossible – to aggregate quantiles together to form larger quantiles. Take a listen to find out what happened.

Read the Full Article

Logging and Eventing with the SEARCH Method

April 9, 2022 Operations sre logs

When working with Software Engineering teams to improve their observability, I often find that working with a method helps tremendously. A method with a short and catchy acronym really drives the lessons home. Soon I’ve got management writing notes about my acronyms and including them in planning meetings. Methods have a winning madness. Now let’s use them for logging too.

Read the Full Article

Prometheus Exemplars in Java Spring Boot

January 17, 2022 Operations prometheus java exemplars tracing

Java isn’t my favorite language to work in. However, I realized that to roll out a successful Observability plan that I needed good examples and most of the teams I work with create Spring Boot applications. So off I went to create a simple Java Spring Boot application that demonstrated a structured logging approach, metrics following the 4 Golden Signals pattern, and integration with basic tracing. The goal is to show how to pivot through one tool’s data to another. The problem was working with Prometheus’s Exemplars.

Read the Full Article

What Is Observability: A Practitioner's View

October 24, 2021 Operations sre prometheus

Charity Majors, from Honeycomb.io, recently wrote “Observability: The 5-Year Retrospective.” In it Charity takes a look back, attempts to define what Observability actually is, and lays out a set of capabilities any Observability platform must have. Charity’s vision helps all of us understand what we strive for in creating better software and services. However, I propose that there is a wide gap between the visionaries (or vendors) and the real Observability requirements for teams on the ground. Observability is a practice and not a product. It is time to define observability from a practitioner’s point of view.

Read the Full Article

Twisted Edwards Curve SSH Keys

August 11, 2021 Operations sysadmin ssh

If you use SSH keys and haven’t migrated to the newer ED25519 Twisted Edwards curve key pairs – well you should. It is presently the most recommended key type. Faster and possibly more secure than RSA key types. Even though this type has been supported by OpenSSH for a number of years now, there are still some tricks to have up your sleeve.

Read the Full Article

Helm Chart Prometheus Rules

August 5, 2021 Operations prometheus helm

The best thing about using the Prometheus Operator to manage Prometheus in Kubernetes is the CRDs. Alert rules can be managed directly by the application Helm Charts, FluxCD, or your Cloud Native pipeline of choice. For me, for now, that is Helm Charts. The only problem is that both Helm and the Prometheus Alert Rules use Golang Text Templates. The question is: how does one keep the Helm Chart templates from getting ugly and super complex?

Read the Full Article

Thinking About Keyboards, Part The Second

May 30, 2021 Operations sysadmin keyboards

Its been quiet around LinuxCzar. Mostly because I’ve been thinking about keyboards again. This line of thought closely follows career changes. When opportunity knocks – as they say.

I first learned how to program in a computer lab full of IBM PS/2 Model 25 machines, each with its own SSK form factor Model M buckling spring keyboard. I’m amazed at my memory of a lab full of those SSK keybaords (which are better known as the tenkeyless or TKL layouts today). Those go for quite a penny on eBay as of this writing. Usually about the same price as what an entire Model 25 goes for.

Read the Full Article

AWS Kinesis Outage

December 7, 2020 Operations aws

On Wednesday, November 25, 2020, AWS suffered an outage of their Kinesis service in the us-east-1 region. These RCA (Root Cause Analysis) write ups are always illuminating. Not just as a peek inside how AWS services works, but as a chance for the industry as a whole to peer review best practices. I had a few key takeaways from my experience in Observability that I want to point out.

At 5:15 AM PST, the first alarms began firing for errors on putting and getting Kinesis records. Teams engaged and began reviewing logs.

Read the Full Article