Monitorama PDX 2023: Finding Π in Observability

Its been a long time since I’ve done a presentation in front of real live people. I was honored to be selected to be a Speaker at Monitorama PDX 2023 that was held last week. I put together a talk on some of the challenges that I, and many others, face in the quest to improve Observability. I wanted to touch on some mathematical concepts that underline choices that I make, and why the Four Golden Signals really matter.

Read the Full Article

Logging and Eventing with the SEARCH Method

April 9, 2022 Operations sre logs

When working with Software Engineering teams to improve their observability, I often find that working with a method helps tremendously. A method with a short and catchy acronym really drives the lessons home. Soon I’ve got management writing notes about my acronyms and including them in planning meetings. Methods have a winning madness. Now let’s use them for logging too.

Read the Full Article

Prometheus Exemplars in Java Spring Boot

January 17, 2022 Operations prometheus java exemplars tracing

Java isn’t my favorite language to work in. However, I realized that to roll out a successful Observability plan that I needed good examples and most of the teams I work with create Spring Boot applications. So off I went to create a simple Java Spring Boot application that demonstrated a structured logging approach, metrics following the 4 Golden Signals pattern, and integration with basic tracing. The goal is to show how to pivot through one tool’s data to another. The problem was working with Prometheus’s Exemplars.

Read the Full Article

What Is Observability: A Practitioner's View

October 24, 2021 Operations sre prometheus

Charity Majors, from Honeycomb.io, recently wrote “Observability: The 5-Year Retrospective.” In it Charity takes a look back, attempts to define what Observability actually is, and lays out a set of capabilities any Observability platform must have. Charity’s vision helps all of us understand what we strive for in creating better software and services. However, I propose that there is a wide gap between the visionaries (or vendors) and the real Observability requirements for teams on the ground. Observability is a practice and not a product. It is time to define observability from a practitioner’s point of view.

Read the Full Article

Writing Change Management Announcements

August 23, 2021 Operations docs

The most important part of a Change Management process is simply being able to tell the stakeholders about an upcoming change. Especially in a world where email is often left unread, there is no one “centralized” group or team making changes, and instant messenger is possibly the only real means of communication. Ideally, a Master Station Log (MSL) or Service Status Page is a focal point for communications about planned events and emergent events. Additional policies may define channels used to broadcast those messages to get the attention of the target audience. Whether or not any of that exists, writing that change announcement really is the most critical part.

Read the Full Article

Twisted Edwards Curve SSH Keys

August 11, 2021 Operations sysadmin ssh

If you use SSH keys and haven’t migrated to the newer ED25519 Twisted Edwards curve key pairs – well you should. It is presently the most recommended key type. Faster and possibly more secure than RSA key types. Even though this type has been supported by OpenSSH for a number of years now, there are still some tricks to have up your sleeve.

Read the Full Article

Helm Chart Prometheus Rules

August 5, 2021 Operations prometheus helm

The best thing about using the Prometheus Operator to manage Prometheus in Kubernetes is the CRDs. Alert rules can be managed directly by the application Helm Charts, FluxCD, or your Cloud Native pipeline of choice. For me, for now, that is Helm Charts. The only problem is that both Helm and the Prometheus Alert Rules use Golang Text Templates. The question is: how does one keep the Helm Chart templates from getting ugly and super complex?

Read the Full Article

Thinking About Keyboards, Part The Second

May 30, 2021 Operations sysadmin keyboards

Its been quiet around LinuxCzar. Mostly because I’ve been thinking about keyboards again. This line of thought closely follows career changes. When opportunity knocks – as they say. I first learned how to program in a computer lab full of IBM PS/2 Model 25 machines, each with its own SSK form factor Model M buckling spring keyboard. I’m amazed at my memory of a lab full of those SSK keybaords (which are better known as the tenkeyless or TKL layouts today).

Read the Full Article

AWS Kinesis Outage

December 7, 2020 Operations aws

On Wednesday, November 25, 2020, AWS suffered an outage of their Kinesis service in the us-east-1 region. These RCA (Root Cause Analysis) write ups are always illuminating. Not just as a peek inside how AWS services works, but as a chance for the industry as a whole to peer review best practices. I had a few key takeaways from my experience in Observability that I want to point out. At 5:15 AM PST, the first alarms began firing for errors on putting and getting Kinesis records.

Read the Full Article

Finding the Golden Signals with Prometheus

November 28, 2020 Operations talks prometheus sre

I was honored to be selected to speak at All Things Open 2020. I wanted to tell the story of architecting Fitbit’s Prometheus and Thanos solution for metrics and alerting. Including the many things I learned and that I think are important to consider as a company scales out their observability platform. Oddly enough, some of this also applies to handling logs and events at scale too. The talk was just uploaded to YouTube.

Read the Full Article