LinuxCzar

Engineering Software, Linux, and Observability. The website of Jack Neely.    

Cardinality Cloud and Prometheus SLOs

I’d like to announce my new venture, Cardinality Cloud, LLC. I’ve always wanted to take what I’ve done for hyper-growth companies like Palo Alto Networks and Fitbit and make it more available to everyone. Readers of the site will not be surprised that I’m specializing in Site Reliability Engineering (SRE) and Observability consulting with a focus on cost effective and Open Source setups. I am also launching products aimed to make that even more easier. The first one is the Prometheus Alert and SLO Generator! Please take a look and let me know what you think.

Read the Full Article

Monitorama PDX 2023: Finding Π in Observability

Its been a long time since I’ve done a presentation in front of real live people. I was honored to be selected to be a Speaker at Monitorama PDX 2023 that was held last week. I put together a talk on some of the challenges that I, and many others, face in the quest to improve Observability. I wanted to touch on some mathematical concepts that underline choices that I make, and why the Four Golden Signals really matter. I close out this talk with an experiment to do the impossible – to aggregate quantiles together to form larger quantiles. Take a listen to find out what happened.

Read the Full Article

Logging and Eventing with the SEARCH Method

When working with Software Engineering teams to improve their observability, I often find that working with a method helps tremendously. A method with a short and catchy acronym really drives the lessons home. Soon I’ve got management writing notes about my acronyms and including them in planning meetings. Methods have a winning madness. Now let’s use them for logging too.

Read the Full Article

Prometheus Exemplars in Java Spring Boot

Java isn’t my favorite language to work in. However, I realized that to roll out a successful Observability plan that I needed good examples and most of the teams I work with create Spring Boot applications. So off I went to create a simple Java Spring Boot application that demonstrated a structured logging approach, metrics following the 4 Golden Signals pattern, and integration with basic tracing. The goal is to show how to pivot through one tool’s data to another. The problem was working with Prometheus’s Exemplars.

Read the Full Article

What Is Observability: A Practitioner's View

Charity Majors, from Honeycomb.io, recently wrote “Observability: The 5-Year Retrospective.” In it Charity takes a look back, attempts to define what Observability actually is, and lays out a set of capabilities any Observability platform must have. Charity’s vision helps all of us understand what we strive for in creating better software and services. However, I propose that there is a wide gap between the visionaries (or vendors) and the real Observability requirements for teams on the ground. Observability is a practice and not a product. It is time to define observability from a practitioner’s point of view.

Read the Full Article

Helm Chart Prometheus Rules

The best thing about using the Prometheus Operator to manage Prometheus in Kubernetes is the CRDs. Alert rules can be managed directly by the application Helm Charts, FluxCD, or your Cloud Native pipeline of choice. For me, for now, that is Helm Charts. The only problem is that both Helm and the Prometheus Alert Rules use Golang Text Templates. The question is: how does one keep the Helm Chart templates from getting ugly and super complex?

Read the Full Article

Thinking About Keyboards, Part The Second

Its been quiet around LinuxCzar. Mostly because I’ve been thinking about keyboards again. This line of thought closely follows career changes. When opportunity knocks – as they say.

I first learned how to program in a computer lab full of IBM PS/2 Model 25 machines, each with its own SSK form factor Model M buckling spring keyboard. I’m amazed at my memory of a lab full of those SSK keybaords (which are better known as the tenkeyless or TKL layouts today). Those go for quite a penny on eBay as of this writing. Usually about the same price as what an entire Model 25 goes for.

Read the Full Article

AWS Kinesis Outage

  December 7, 2020   Operations   aws

On Wednesday, November 25, 2020, AWS suffered an outage of their Kinesis service in the us-east-1 region. These RCA (Root Cause Analysis) write ups are always illuminating. Not just as a peek inside how AWS services works, but as a chance for the industry as a whole to peer review best practices. I had a few key takeaways from my experience in Observability that I want to point out.

At 5:15 AM PST, the first alarms began firing for errors on putting and getting Kinesis records. Teams engaged and began reviewing logs.

Read the Full Article

Finding the Golden Signals with Prometheus

I was honored to be selected to speak at All Things Open 2020. I wanted to tell the story of architecting Fitbit’s Prometheus and Thanos solution for metrics and alerting. Including the many things I learned and that I think are important to consider as a company scales out their observability platform. Oddly enough, some of this also applies to handling logs and events at scale too. The talk was just uploaded to YouTube.

Read the Full Article