LinuxCzar

Engineering Software, Linux, and Operations. The website of Jack Neely.    

Thinking About Keyboards, Part The Second

Its been quiet around LinuxCzar. Mostly because I’ve been thinking about keyboards again. This line of thought closely follows career changes. When opportunity knocks – as they say. I first learned how to program in a computer lab full of IBM PS/2 Model 25 machines, each with its own SSK form factor Model M buckling spring keyboard. I’m amazed at my memory of a lab full of those SSK keybaords (which are better known as the tenkeyless or TKL layouts today).

Read the Full Article

AWS Kinesis Outage

  December 7, 2020   Operations   aws
On Wednesday, November 25, 2020, AWS suffered an outage of their Kinesis service in the us-east-1 region. These RCA (Root Cause Analysis) write ups are always illuminating. Not just as a peek inside how AWS services works, but as a chance for the industry as a whole to peer review best practices. I had a few key takeaways from my experience in Observability that I want to point out. At 5:15 AM PST, the first alarms began firing for errors on putting and getting Kinesis records.

Read the Full Article

Finding the Golden Signals with Prometheus

I was honored to be selected to speak at All Things Open 2020. I wanted to tell the story of architecting Fitbit’s Prometheus and Thanos solution for metrics and alerting. Including the many things I learned and that I think are important to consider as a company scales out their observability platform. Oddly enough, some of this also applies to handling logs and events at scale too. The talk was just uploaded to YouTube.

Read the Full Article

42 Lines Site Reliability Engineering

In October I had my first foray into webinars. The consulting company I work with, 42 Lines, Inc., gave me the opportunity to help put together some marketing material. We’ve been providing services that we call “DevOps Consulting” for a long time. However, I wanted to push forward with the idea that what we’ve been doing may be best described today with the Site Reliability Engineering terminology. Its true. I’ve believed in monitoring (now Observability) first, setting measurable goals, using these tools for automation, and encouraging continual improvement from data for many years.

Read the Full Article

Quick and Dirty Sockets

  August 23, 2020   Operations   sockets
An axiom of programming usually found in the Operational, DevOps, and SRE spaces is this: Don’t write a socket server. There’s good reason. There are a lot of edge cases to handle and, for lack of a better phrase, magic sauce to be efficient and secure. Not to mention there are a lot of libraries already written to do this for you. However, this really applies to writing socket clients as well.

Read the Full Article

Calculating the Error of Quantile Estimation with Histograms

I’ve posted about Histograms before. They fascinate me with the prowess they bring to recording telemetry about high volume events. The representations on disk can be quite small, but they can be used to produce a slew of highly accurate summary statistics. Histograms are both Robust and Aggregatable data structures and its rare to have a data type that combines both of those features in the field of Observability and especially with metrics.

Read the Full Article

SRE: How to Count With Logs

  July 30, 2020   Operations   sre
Before an SRE can get into advanced anomaly detection or statistical models this SRE must first bring the skills of recording events and being able to count those events. If you are just starting this journey focus first on gathering events from a small service. Gather these in the form of logs. A log message is an ordered record of a unique event. Of course, ordered means we apply a time stamp to the log message.

Read the Full Article

What is a Site Reliability Engineer?

A Site Reliability Engineer (SRE) is a role within your company. Many of the concepts this role embodies existed before it was popularized by Google. However, Google did write the books and provide the methods we currently think of for an SRE. Site Reliability Engineers work across many teams and disciplines to bring about the DevOps culture we’ve been talking about for over a decade. In contrast, DevOps is very much an organizational and cultural shift to break down silos and communication barriers between teams.

Read the Full Article

Open Observability: SRE Prometheus Tips

I was honored to be asked to help the Open Observability folks kick off the first episode of their vidcast! This talk is an adaptation of the lighting talk I did at Monitorama PDX 2019 with some expanded material for alerting on Service Level Objectives (SLOs) with Prometheus. Also, the slide deck I used is below. A blog post featuring the continuing Open Observability Talks from Logz.

Read the Full Article

A Site Reliability Engineer Series

  June 23, 2020   Operations   sre
Are you a Site Reliability Engineer? Perhaps your team is looking to start an SRE journey or expand their resources on hand. Or maybe you are looking to improve your skill set. We’re all engineers, and its about time we actually use mathematics to prove our designs and practices. We didn’t get to the Moon by winging it after mashing together some components into a workable system. Rather, we used math to ensure our success.

Read the Full Article