Engineering Software, Linux, and Observability. The website of Jack Neely.    

Writing Change Management Announcements

  August 23, 2021   Operations   docs

The most important part of a Change Management process is simply being able to tell the stakeholders about an upcoming change. Especially in a world where email is often left unread, there is no one “centralized” group or team making changes, and instant messenger is possibly the only real means of communication. Ideally, a Master Station Log (MSL) or Service Status Page is a focal point for communications about planned events and emergent events. Additional policies may define channels used to broadcast those messages to get the attention of the target audience. Whether or not any of that exists, writing that change announcement really is the most critical part.

Read the Full Article

Helm Chart Prometheus Rules

The best thing about using the Prometheus Operator to manage Prometheus in Kubernetes is the CRDs. Alert rules can be managed directly by the application Helm Charts, FluxCD, or your Cloud Native pipeline of choice. For me, for now, that is Helm Charts. The only problem is that both Helm and the Prometheus Alert Rules use Golang Text Templates. The question is: how does one keep the Helm Chart templates from getting ugly and super complex?

Read the Full Article

Thinking About Keyboards, Part The Second

Its been quiet around LinuxCzar. Mostly because I’ve been thinking about keyboards again. This line of thought closely follows career changes. When opportunity knocks – as they say. I first learned how to program in a computer lab full of IBM PS/2 Model 25 machines, each with its own SSK form factor Model M buckling spring keyboard. I’m amazed at my memory of a lab full of those SSK keybaords (which are better known as the tenkeyless or TKL layouts today).

Read the Full Article

AWS Kinesis Outage

  December 7, 2020   Operations   aws
On Wednesday, November 25, 2020, AWS suffered an outage of their Kinesis service in the us-east-1 region. These RCA (Root Cause Analysis) write ups are always illuminating. Not just as a peek inside how AWS services works, but as a chance for the industry as a whole to peer review best practices. I had a few key takeaways from my experience in Observability that I want to point out. At 5:15 AM PST, the first alarms began firing for errors on putting and getting Kinesis records.

Read the Full Article

Finding the Golden Signals with Prometheus

I was honored to be selected to speak at All Things Open 2020. I wanted to tell the story of architecting Fitbit’s Prometheus and Thanos solution for metrics and alerting. Including the many things I learned and that I think are important to consider as a company scales out their observability platform. Oddly enough, some of this also applies to handling logs and events at scale too. The talk was just uploaded to YouTube.

Read the Full Article

42 Lines Site Reliability Engineering

In October I had my first foray into webinars. The consulting company I work with, 42 Lines, Inc., gave me the opportunity to help put together some marketing material. We’ve been providing services that we call “DevOps Consulting” for a long time. However, I wanted to push forward with the idea that what we’ve been doing may be best described today with the Site Reliability Engineering terminology. Its true. I’ve believed in monitoring (now Observability) first, setting measurable goals, using these tools for automation, and encouraging continual improvement from data for many years.

Read the Full Article

Quick and Dirty Sockets

  August 23, 2020   Operations   sockets
An axiom of programming usually found in the Operational, DevOps, and SRE spaces is this: Don’t write a socket server. There’s good reason. There are a lot of edge cases to handle and, for lack of a better phrase, magic sauce to be efficient and secure. Not to mention there are a lot of libraries already written to do this for you. However, this really applies to writing socket clients as well.

Read the Full Article

Calculating the Error of Quantile Estimation with Histograms

I’ve posted about Histograms before. They fascinate me with the prowess they bring to recording telemetry about high volume events. The representations on disk can be quite small, but they can be used to produce a slew of highly accurate summary statistics. Histograms are both Robust and Aggregatable data structures and its rare to have a data type that combines both of those features in the field of Observability and especially with metrics.

Read the Full Article

SRE: How to Count With Logs

  July 30, 2020   Operations   sre
Before an SRE can get into advanced anomaly detection or statistical models this SRE must first bring the skills of recording events and being able to count those events. If you are just starting this journey focus first on gathering events from a small service. Gather these in the form of logs. A log message is an ordered record of a unique event. Of course, ordered means we apply a time stamp to the log message.

Read the Full Article