# Patching Prometheus for Better Histograms

April 22, 2017

Categories:
Operations

Tags:
foobar

Reading this blog I hope its clear that I’m a fan of using histograms for metrics and visibility. Over a StatsD-like approach that offers a series of summary metrics, histograms give us the ability to:

- Actually visualize the distribution. You can see if your distribution is mutli-model, for example. This is done with a heatmap.
- Aggregation. You can aggregate histograms (with the same bucket boundaries) together and produce summary metrics for an entire service. Remember, if you generate percentiles for each application instance you cannot aggregate those to get a percentile for the entire service without the raw data.
- Histograms can be used to produce arbitrary quantile/percentile estimation. You can even estimate the mean. Accuracy is controlled by the granularity of the bucket widths.
- Significantly less storage requirements than the raw data, although a bit more than a small set of summary metrics.

So, what is a histogram? Let’s say we want to measure response latencies from an application. Take the possible range of latencies, from 0 to 10 seconds for example, and we break those into bins or buckets. If we have a bucket width of 2, then the first bucket would count observations from 0 to 2, the second bucket observations greater than 2 and less then or equal to 4, etc.

[ histogram examples here]

Prometheus implements histograms as cumulative histograms. This means the first bucket is a counter of observations less than or equal to 2, the second bucket is a counter of observations less than or equal to 4, etc. Each bucket contains the counts of all prior buckets.

[ cumulative histogram example here]

These are built on Prometheus’ counter metric type and each bucket is its own counter. This handles resets from process restarts and provides a histogram that is not otherwise broken into specific time windows. Next, we need to break this data down into histograms per time window to visualize or build summary metrics over time. We do this by taking the rate of the buckets over 1 minute or any other time range.

```
rate(http_request_latency_bucket)[1m]
```

This produces a series of histograms that each contain data from the past 1
minute. One histogram per step interval on the graph. From here you can
make use of the `histogram_quantile()`

function to generate quantiles over
time or a number of other things.

This is great, right? We finally have an Open Source time series data base tool that gives histograms to the masses! Well, that’s what I thought.

## Knowing Your Distribution

The default histogram buckets are probably less than useful for what you are measuring. However, Prometheus’s implementation requires that you define the histogram bucket boundaries in code, up front, before you generate any data. How do you know what the data looks like to set useful bucket boundaries?

The normal answer is that you do not and you will adjust your histogram at a latter point. With Prometheus’s implantation this basically causes corruption of the histogram data when you query the time window in which the change occurs.

## Accuracy vs Cardinality

The Prometheus Best Practices documents state that the maximum cardinality of a metric should be about 10 unique label/value pairs. Clearly you need to control how many sets of label/value pairs as each set makes a unique timeseries in memory and on disk. But metrics that exceed 10 different label/value sets should be limited to a “handful” across your entire system.

This is enforced in Brian Brazil’s excellent presentation on “Counting with Prometheus” around the 30 minute mark. (The Cloud Native Foundation keep’s re-releasing their recordings, search for the presentation if the link goes dead.) Histograms should be used sparingly.

The default bucket boundaries for a histogram type metric creates 10 buckets which is the supposed maximum cardinality. However, if one wants arbitrary quantile estimations to within 1% or 2% you need hundreds of buckets. See my previous post about using Log Linear histograms with Prometheus.

Prometheus can handle millions of metrics, but think about using a couple of histograms with 100 buckets per REST API end point and per status code in a container application with 300 instances in the cluster. Suddenly, you have a million metrics.

The advice that the Prometheus documentation gives is to set bucket boundaries for SLA points or other important numbers. This gives you accurate information if your quantile value exceeds (or not) your SLA. However, the accuracy of the quantile estimation is, at best, misleading. At worse off by more than a couple hundred percent.

So the choice here is between stability of the metrics platform or accuracy of quantile estimations. Is there a compromise? Is this better than using StatsD and taking the quantile of quantiles? (That should anger the math nerds out there!)

The Lack of Scrape Atomicity

Recording Rules and Federation