LinuxCzar

Engineering Software, Linux, and Observability. The website of Jack Neely.    

Monitorama PDX 2023: Finding Π in Observability

Its been a long time since I’ve done a presentation in front of real live people. I was honored to be selected to be a Speaker at Monitorama PDX 2023 that was held last week. I put together a talk on some of the challenges that I, and many others, face in the quest to improve Observability. I wanted to touch on some mathematical concepts that underline choices that I make, and why the Four Golden Signals really matter.

Read the Full Article

Logging and Eventing with the SEARCH Method

When working with Software Engineering teams to improve their observability, I often find that working with a method helps tremendously. A method with a short and catchy acronym really drives the lessons home. Soon I’ve got management writing notes about my acronyms and including them in planning meetings. Methods have a winning madness. Now let’s use them for logging too.

Read the Full Article

Prometheus Exemplars in Java Spring Boot

Java isn’t my favorite language to work in. However, I realized that to roll out a successful Observability plan that I needed good examples and most of the teams I work with create Spring Boot applications. So off I went to create a simple Java Spring Boot application that demonstrated a structured logging approach, metrics following the 4 Golden Signals pattern, and integration with basic tracing. The goal is to show how to pivot through one tool’s data to another. The problem was working with Prometheus’s Exemplars.

Read the Full Article

What Is Observability: A Practitioner's View

Charity Majors, from Honeycomb.io, recently wrote “Observability: The 5-Year Retrospective.” In it Charity takes a look back, attempts to define what Observability actually is, and lays out a set of capabilities any Observability platform must have. Charity’s vision helps all of us understand what we strive for in creating better software and services. However, I propose that there is a wide gap between the visionaries (or vendors) and the real Observability requirements for teams on the ground. Observability is a practice and not a product. It is time to define observability from a practitioner’s point of view.

Read the Full Article

Writing Change Management Announcements

  August 23, 2021   Operations   docs

The most important part of a Change Management process is simply being able to tell the stakeholders about an upcoming change. Especially in a world where email is often left unread, there is no one “centralized” group or team making changes, and instant messenger is possibly the only real means of communication. Ideally, a Master Station Log (MSL) or Service Status Page is a focal point for communications about planned events and emergent events. Additional policies may define channels used to broadcast those messages to get the attention of the target audience. Whether or not any of that exists, writing that change announcement really is the most critical part.

Read the Full Article

Helm Chart Prometheus Rules

The best thing about using the Prometheus Operator to manage Prometheus in Kubernetes is the CRDs. Alert rules can be managed directly by the application Helm Charts, FluxCD, or your Cloud Native pipeline of choice. For me, for now, that is Helm Charts. The only problem is that both Helm and the Prometheus Alert Rules use Golang Text Templates. The question is: how does one keep the Helm Chart templates from getting ugly and super complex?

Read the Full Article

Thinking About Keyboards, Part The Second

Its been quiet around LinuxCzar. Mostly because I’ve been thinking about keyboards again. This line of thought closely follows career changes. When opportunity knocks – as they say. I first learned how to program in a computer lab full of IBM PS/2 Model 25 machines, each with its own SSK form factor Model M buckling spring keyboard. I’m amazed at my memory of a lab full of those SSK keybaords (which are better known as the tenkeyless or TKL layouts today).

Read the Full Article

AWS Kinesis Outage

  December 7, 2020   Operations   aws
On Wednesday, November 25, 2020, AWS suffered an outage of their Kinesis service in the us-east-1 region. These RCA (Root Cause Analysis) write ups are always illuminating. Not just as a peek inside how AWS services works, but as a chance for the industry as a whole to peer review best practices. I had a few key takeaways from my experience in Observability that I want to point out. At 5:15 AM PST, the first alarms began firing for errors on putting and getting Kinesis records.

Read the Full Article

Finding the Golden Signals with Prometheus

I was honored to be selected to speak at All Things Open 2020. I wanted to tell the story of architecting Fitbit’s Prometheus and Thanos solution for metrics and alerting. Including the many things I learned and that I think are important to consider as a company scales out their observability platform. Oddly enough, some of this also applies to handling logs and events at scale too. The talk was just uploaded to YouTube.

Read the Full Article