What Is Observability: A Practitioner's View

Charity Majors, from Honeycomb.io, recently wrote “Observability: The 5-Year Retrospective.” In it Charity takes a look back, attempts to define what Observability actually is, and lays out a set of capabilities any Observability platform must have. Charity’s vision helps all of us understand what we strive for in creating better software and services. However, I propose that there is a wide gap between the visionaries (or vendors) and the real Observability requirements for teams on the ground. Observability is a practice and not a product. It is time to define observability from a practitioner’s point of view.

Liz Fong-Jones defines observability incredibly well. As an engineer and a practitioner of observability, this rings true. Again, a visionary from Honeycomb and quoted in Charity’s article. But I take much inspiration from this.

Observability is not a binary state; it requires work over time. It’s the ability to swiftly surface relevant data to solve perf regressions, understand how your system works or your code behaves, and resolve incidents. Observability uses instrumentation to help provide context and insights to aid monitoring. While monitoring can help you discover there is an issue, observability can help you discover why.

– Liz Fong-Jones

Let’s dig in a little closer at this quote and take a practitioner’s view point. To wrap up, I really did like the way Charity defined Observability by defining the capabilities that an Observability Platform should have. I’ll give my own take on those.

As a side note, I absolutely refuse to refer to Observability or similar spaces in the Site Reliability Engineering world as “o11y.” Well, except when its really called for.

Over Designing the Problem

Ever heard that it isn’t the destination but the journey that matters? The same is true with Observability. This isn’t something you can just “turn on.” Especially as things start to scale up. If anyone tells you otherwise then they are definitely selling something.

Observability is a practice. For SRE teams that own Observability, it is the practice of setting up an Observability Platform (whether it is built or bought). For all Software Engineers it’s the practice of writing observable code and microservices. This most likely means working with the SRE team to understand the standards and tools to code against so observability is equally applied and available across teams. It’s an iterative feedback loop.

Many teams reach for the latest hyped o11y trends or start-up firm. Or laying our complex plans for a Three Pillar Observability Master System plus extra stuff. Here, danger lies. Observability has turned into a much hyped very vendor focused space. Rather, answer the basic questions first and don’t paint the bike shed.

Can customers reach the site?
Are our application instances up and running?
Is debugging information and stack traces available as forensics during an incident?

If these aren’t somehow met, the team does not need an alien technology based advanced tracing and distributed debugger solution. Tracing isn’t first, the customers are. In Charity’s 3 Year Retrospective, one of the best quotes is as follows.

The request is what maps to the user’s real lived experience.

– Charity Majors

This is true, especially as complex systems evolve. However, it is quite common that teams cannot satisfy the three conditions above. Having those three conditions in place will go a long way toward handling customer facing incidents.

From here I recommend building Observability Standards around being able to identify and dashboard the Google SRE’s 4 Golden Signals. Those signals are:

Traffic – The unit of work the microservice does.
Errors – The number of work units failed to process correctly.
Latency – A distribution of how long work units take to process.
Saturation – A measure of capacity or limit of how many work units can be processed.

When these are understood for services a human paged for an incident will have a good chance at pinpointing where and what the problem is. Indeed, these are normally represented as metrics in a time-series database with a bit of math sprinkled on top. I believe that metrics are magic. Therefore, a good metrics system (like Prometheus) can and does help teams solve a wide array of incidents effectively. Metrics are quite efficient to store as well providing obvious cost advantages. This usually hits a sweet spot where understanding is, to further quote the Google SRE Book, “as simple as possible, no simpler.”

But this doesn’t meet Charity’s definition of Observability! Neither do the Google SRE Books really mention “observability.” Well, they do say you can Make Troubleshooting Easier by whipping out a can of Observability, but the list of ingredients are not mentioned. Indeed, the Google SRE books were published around the time that the use of the word “observability” started taking off.

Understanding Unknown Unknowns

Donald Rumsfeld. I can’t help but to think of his quote from February, 2002 when I see this phrase. Perhaps this has ruined the concept for me because my brain that works so well with concrete concepts starts filtering out the vague or inscrutable concepts that are surely about to follow after such a phrase.

However, this is all about objectively classifying concepts. If the brain’s working memory acts like an inverted index then it contains a list of concepts that link to the deeper understanding of that concept. When a concept is presented, it will either be present in the inverted index or not. If present, the linked understanding either offers an explanation or that the concept is not understood.

A Known Known

The concept is present in the inverted index (known) and links to understanding of that concept (known). I know that if I count the number of HTTP API hits a service receives I can understand the rate per second of that traffic. If the rate drops the service isn’t receiving the same level of traffic.

A Known Unknown

In this case the concept is in the inverted list – we are familiar with it (known). However we know that we do not understand this topic. An HTTP load balancer like an AWS ALB is a known concept, but how it routes HTTP traffic to the (in)correct service may not be understood by a junior SRE.

Or, my favorite, I know that gravity is real but to understand it a quantum theory of gravity must be formulated to understand how it interacts with other quantum fields.

An Unknown Known

This is experience or intuition. This is the ability to jump straight to understanding of a concept or problem. When sporadic lag or failure patterns appear in a new (unknown) service and an experienced SRE immediately checks to see if DNS (known) is working and solves the incident. After all, its always DNS.

An Unknown Unknown

Here is utter befuddlement. There is no understanding (unknown) what is broken and no ability (unknown) to leap toward a solution. It is a lack of information – other than the customers screaming about the product. Here be dragons and the stuff nightmares are made of. Progress isn’t made without finding further information which may take quite some time.

I wanted to build a demo of using Prometheus custom metrics in a Java microservice. Something I have done many times. To take it to the next level I integrated the code with the OpenTelemetry Agent and expected the Prometheus client Java library to begin to add exemplars to the Prometheus metrics endpoint. But it did not. The code looked correct to me and others in the Prometheus community, yet exemplars did not work. I had no idea why exemplars were not working and couldn’t come to a solution for why. Even debugging the code by hand didn’t produce insight. I decided I’d better build a different demo.

This is precisely where SREs and Software Engineers want and need tools to discover new information about how the system is working. Better tools reveal this information quickly by fast database queries or linking bits of information together like breadcrumbs. It is this ability to “surface relevant data to solve […] regressions, understand how your system works or your code behaves, and resolve incidents” that transitions into Observability.

Keep in mind the SRE mantra “As Simple as Possible, No Simpler.” These systems are large scale, fun to build, complex, and amazingly useful. But this doesn’t solve the problem of the dragons either.

Monitoring Versus Debugging

Let’s take stock. The “Observability Platform” described so far is very much on the side of metrics and monitoring with the acknowledgment that there is this concept of an “unknown unknown.” This platform enables basic symptom based monitoring where the most meaningful alerts live. In the case of an “unknown unknown” issue this level notifies the team that a customer impacting incident is in progress. Also, layered on top is tracking the 4 Golden Signals which provides our Service Level Indicators (SLIs) for each service or application. When that page comes in, this dashboarding, or better, existing low priority alerting can quickly identify the affected service or application in the stack.

At this point these SLIs can be used to build Service Level Objectives (SLOs). This means that building quantitative goals for the microservices that make up the entire stack is possible. We are Software Engineers, isn’t it time we use the tools of the Engineering trade to quantitatively automate and prove our goals? These SLOs build statistical models around:

Availability
Latency
Throughput
Correctness

In turn, these statistical models allow the teams to create specific goals and measure (quantitatively) their achievement against them. Not only does this provide a rigorous method to spot availability or latency problems that may be causing a customer facing incident, but it provides feedback to the company for how to steer teams to create a better product producing a better customer experience.

Clearly, this uses instrumentation not only to provide monitoring, but to be able to accurately measure and create statistical models. This instrumentation provides a rich insight that is integral to the metrics and monitoring side of the Observability problem space. Instrumentation isn’t just to “provide context and insights to aid monitoring.” A standard instrumentation practice is essential to understanding the health and performance of the microservices that make up the whole customer experience. It is only at this point where a team facing an incident understands “what” and can begin to formulate the question “Why?”

Now, who is this platform for? An incident has been identified, a service is seen as the likely culprit, and the change of behavior of that service (over time) is known. The Operations, DevOps, or SRE team (whatever we’re calling it today) has been able to successfully triage the problem using advanced Engineering tactics (well, okay, Statistics). But who is this system for? How does this help the developer team identify the root cause?

The point here is that understanding the health and performance of the microservices that make up the stack is totally and completely essential to success. Using the Scientific Method and math, all of this can be done with a robust metrics and monitoring platform without sampling or filtering data. In fact, this is why I like Prometheus and similar systems so very much. It is a cost effective and simple system upon which to build Observability. But the draw back here is that this is really for the SREs only. It doesn’t help the application teams other than communicating who is slow in the chain. While one cannot have Observability without the ability to monitor – it becomes Observability when the feedback loop between the customers, SRE teams, and the application developer teams is fully closed. The loop is closed when an engineer on the application team can select a especially laggy request, pivot over to the Events subsystem of the Observability Platform and identify that a specific customer is able to trigger SQL queries that cause the high latency. An Observability Platform truly is observability when it provides the ability to perform distributed debugging on top of watching for incidents.

This gets to my definition of Observability. It isn’t about how to deal with unknown unknowns. Rather, observability is being able to analyze streaming telemetry data from applications, identify health and performance incidents, define team and company goals, understand the customer experience, and provide data to allow for efficient debugging of the underlying code base. Observability doesn’t replace monitoring, but it stands on the shoulders of advanced monitoring systems to close the feedback loop for all of the teams involved in an incident.

Capabilities

Charity takes to defining Observability by the capabilities an Observability Platform must have. I really like this approach to focusing what, exactly, Observability is. However, my capabilities vary a bit. This reflects the belief that Observability requires the underlying foundation of monitoring coupled with having a strategy for how it will be implemented.

Synthetic User Monitoring: The ability to run HTTPS probes against the application and simulate user actions such as a successful login. This testing is geo-regional in scope and provides insight into the user experience or site failure from around the global Internet.
Metrics: The ability to efficiently store for years the health and status information of infrastructure and the health and performance of applications. This includes recording the 4 Golden Signals in a way that allows year-over-year trend analysis. Schema is present to identify and namespace the source of the data.
Structured Events: Ingest and store short term (months) raw JSON structures emitted by applications. These structures follow the same basic schema that namespaces and identifies where the data originates and optionally includes Trace IDs to form a directed acyclic graph (DAG) to represent a user request.
Robust Query Language: A query expression language that allows for slicing and dicing the data among any recorded dimention and providing the ability to do arbitrary mathmatical operations. This forms the basis of advanced statistical analysis that becomes possible. Support the ability to robustly aggregate Timers and build arbitrary and accurate quantiles.
Real Time Alerting: Alert expressions can be run and generate results near real time. Alerting closer to the application rather than the far reaches of the Observability Platform provides key advantages here.
Pipelines: Building of a telemetry data pipeline from the application containers all the way to the Observability Platform. Schema to identify and namespace telemetry is automatically added in a standard way, and controls exist to filter telemetry traffic at all levels of the pipeline.
Exploratory User Interface: A graphical user interface for creation of 4 Golden Signals dashboards and to understand infrastructure status and application performance over time. Combined with tools to be able to explore the data in an ad-hoc manner using the same query language. This provides a single pane of glass interface to Synthetics, Metrics, Events, and Alerts.

I believe that this set of capabilities can and should be solution agnostic or made of multiple solutions. Each solution will craft the Observability Platform slightly more in favor of the use case of the business. Using vendors strategically in this Platform also works well to outsource some of the work in a way that doesn’t break the budget. However, it is my experience that no vendor provides a top notch solution to all of these capabilities that seamlessly fits together.

Maybe I’m wrong. I’ve been very much in the wrong on this blog before and I will be again! However, I don’t have anything to sell you, but I have seen this sort of platform create a cost efficient and amazingly useful Observability solution for large scale companies. True, this does lean toward the Google SRE path which may appear a bit dated. However, this Platform puts you in the driver seat to create automation and algorithms that watch your telemetry rather than having a human intuit their way through an outage.

Looking for more? Checkout Episode 125 of the Practical Operations Podcast!

LinuxCzar