Histograms with Prometheus: A Tale of Woe
I’m a big fan of using histograms for metrics and visibility. Over a StatsD-like approach that offers a series of summary metrics, histograms give us the ability to:
- Actually visualize the distribution. You can see if your distribution is multimodal, for example. This is done with a heatmap.
- Aggregation. You can aggregate histograms (with the same bucket boundaries) together and produce summary metrics for an entire service. Remember, if you generate percentiles for each application instance you cannot aggregate those to get a global percentile for the entire service without the raw data.
- Histograms can be used to produce arbitrary quantile/percentile estimations. You can even estimate the mean. Accuracy is controlled by the granularity of the bucket widths. Prometheus even maintains a sum of all observations to produce an exact arithmetic mean.
- Significantly less storage requirements than the raw data, although a bit more than a small set of summary metrics.
So, what is a histogram? Let’s say we want to measure response latencies from an application. Take the possible range of latencies, from 0 to 10 seconds for example, and we break those into bins or buckets. If we have a bucket width of 0.5 seconds, then the first bucket would count observations from 0 to 0.5, the second bucket would show the count of observations greater than 0.5 and less then or equal to 1, etc.
Prometheus implements histograms as cumulative histograms. This means the first bucket is a counter of observations less than or equal to 0.5, the second bucket is a counter of observations less than or equal to 1, etc. Each bucket contains the counts of all prior buckets.
These are built on Prometheus’s counter metric type and each bucket is its own counter. This handles resets from process restarts and provides a histogram that, basically, grows over time as observations are recorded.
Next, we need to break this data down into histograms per time window to visualize or build summary metrics over time. If we want a mean or a percentile we need a specific, uniform time window over which to build that summary metric. For example every 60 seconds.
We do this by taking the rate of the buckets over 1 minute (or your time window of choice). This gives us the change in the counters over the last minute or a data set of the counts of observations over the last minute. In pseudo-PromQL:
This produces a series of histograms that each contain data from the past 1
minute. One histogram per step interval on the graph. From here you can
make use of the
histogram_quantile() function to generate quantiles per
1 minute bucket and graph that through time. Or a number of other things.
This is great, right? We finally have an Open Source time series data base tool that gives histograms to the masses! Well, that’s what I thought.
Knowing Your Distribution
The default histogram buckets are probably less than useful for what you are measuring. However, Prometheus’s implementation requires that you define the histogram bucket boundaries in code, up front, before you generate any data. How do you know what the data looks like to set useful bucket boundaries?
The normal answer is that you do not and you will adjust your histogram at a later point. With Prometheus’s implementation this basically causes corruption of the histogram data when you query the across the time window where the re-bucketing change happens.
If you use the default histogram buckets, or guess poorly (likely) the first
time around, you will probably see a straight line at 10 (or your highest
bucket boundary) when you calculate quantiles on this histogram. This is
due to the fact that your quantiles or most of your observations are found
+Inf bucket. No linear approximation saves you here, the only
“sane” thing to do is return the lower known bound.
In any case, this doesn’t give you any idea about your observations over time. It doesn’t even help one figure out how to better adjust the bucket boundaries. About the only thing you can do here is use the sum of observations and divide by the count of observations to get a mean of the real data. In pseudo-PromQL:
rate(http_request_latency_sum[1m]) / rate(http_request_latency_count[1m])
Accuracy vs Cardinality
The Prometheus Best Practices documents state that the maximum cardinality of a metric should be about 10 unique label/value pairs. Clearly you need to control how many sets of label/value pairs as each set makes a unique timeseries in memory and on disk. But metrics that exceed 10 different label/value sets should be limited to a “handful” across your entire system.
This is enforced in Brian Brazil’s excellent presentation on “Counting with Prometheus” around the 30 minute mark. (The Cloud Native Foundation keeps re-releasing their recordings, search for the presentation if the link goes dead.) Histograms should be used sparingly.
The default bucket boundaries for a histogram type metric creates 10 buckets which is the supposed maximum cardinality. However, if one wants arbitrary quantile estimations to within 1% or 2% you need hundreds of buckets. See my previous post about using Log Linear histograms with Prometheus.
Prometheus can handle millions of metrics, but think about using a couple of histograms with 100 buckets per REST API end point and per status code in a container application with 300 instances in the cluster. Suddenly, you have a million metrics and scaling issues with your Prometheus service.
The advice that the Prometheus documentation gives is to set bucket boundaries for SLA points or other important numbers. This gives you accurate information if your quantile value exceeds (or not) your SLA. However, the accuracy of the quantile estimation itself is, at best, misleading. At worse off by more than a couple hundred percent.
So the choice here is between stability of the metrics platform or accuracy of quantile estimations. Is there a compromise? Is this better than using StatsD and taking the quantile of quantiles? (That should anger the math nerds out there!)
The Lack of Scrape Atomicity
The scrape operation that Prometheus uses to ingest data from a client has
no atomicity guarantees. So, when recording rules, or graph expressions
execute they may well operate on partially ingested data from clients. This
isn’t very visible when working with individual counters and gauges, but the
ramifications on the compound metric types, like histograms, are immense.
In fact, when you take the
rate(histogram_bucket[1m]) the operation will
evaluate histogram metric types that are, randomly, only partially updated
with observation counts.
How bad can this be? Take a look at the Empirical Cumulative Distribution function. This shows the ratio of observations that have been observed after $x$ seconds. The following is the CDF function from the example latency data above.
This is very much related to the cumulative histogram. One of the most important properties of how this function describes a sample distribution is that it must always monotonically increase. It can never decrease. You simply cannot have seen fewer observations by waiting longer for them.
A CDF function can be approximated with Prometheus data by querying for:
sum(rate(http_request_latency_bucket)[1m]) by (le)
Next use a bit of Python and R to graph the
le label as $x$ vs the value.
This will produce a very similar CDF plot. These examples are of a different
distribution than the examples above. But this is actual data that occurred
in practice. For example:
Okay, not pretty but normal. The red line indicates the 95th percentile. But back to our problem. With the lack of atomicity of scrapes and, therefore, exposing the expression evaluator to in flux data, very fun things are produced.
Now things start to come apart at the seams. Where the red line intersects the CDF plot is how we locate the histogram bucket containing the 95th percentile as that bucket contains 0.95 times the observation count. Multiple buckets seem to meet this criteria.
To make matters worse, Prometheus uses a binary search algorithm (because the data monotonically increases, right?) to find the correct bucket holding the $q$ percentile. As this search jumps around to efficiently search the array, it could return any one of the buckets that match this criteria. This shows itself on your graphs as large spikes in your percentile data, probably up into the highest boundary you have configured. This obscures the real values/trend of the percentile data, and indicates a false problem.
Recording Rules and Federation
Federation is a Prometheus technique to share metric data from one Prometheus server to another. I use it as part of our long term storage path for important metric data. It also offers some stability for dashboards when the local Prometheus server is an ephemeral Docker / Mesos job. Federation suffers from this lack of atomicity too. So, if you are querying for histogram quantile estimations after the federation step you have two levels corrupting your histogram data.
As Operations folks, we like to avoid duplication of work, so a usage pattern
we developed was to store the rate of the histogram in a recording rule. We
could then reference that recording rule in the following recording rules that
make use of
histogram_quantile() to build up the percentiles desired. This
ensured that when we hit atomicity problems all of our percentiles were
corrupted. We’ve stopped doing this and now we don’t lose all of our
percentile calculations at the same time.
The Prometheus folks are discussing these atomicity issues. Other TSDB systems I know of also do not have similar atomicity guarantees. Although, the ones I am familiar with also do not support compound metric types like Prometheus’s Histogram and Summary types.
We have traditionally gotten away from these issues by using a StatsD like approach. Raw data is stored temporarily until a configurable time window expires and then summary metrics are generated and stored in the TSDB. This produces exact percentiles, means, and other summary statistics. We lose some power of aggregation which is normally overcome by writing to the same StatsD key rather than a key per application instance. But this sets the bar pretty high.
Prometheus has no usable solution for dealing with StatsD like data:
- Small histograms produce wildly inaccurate percentiles and median estimates.
- Histograms designed for accuracy quickly scale beyond what a Prometheus server can handle.
- Histograms (Summary types too) are potentially always in an invalid or racy state that produce completely erroneous percentile estimates. The more buckets used the more likely one is to hit this problem.
- Federation used to store data for Grafana dashboards from ephemeral Prometheus nodes make problems worse.
- Summaries produce percentiles that are not aggregatable and this cannot be worked around in a similar fashion to what is commonly done with StatsD.
- Prometheus has no built in way of visualizing the entire distribution as represented in the histogram.
To help with these problems a colleague and I wrote a PR for Prometheus’s quantile estimation algorithm. This code forces the buckets in the histogram to be monotonically increasing. This has been accepted and is in Prometheus versions 1.6.0 and better. This dampens the effects of a partially scraped histogram but cannot do much for the accuracy of the percentile estimations. Its a small step toward a better solution, but the issues here stem from fundamental design problems.
The Prometheus authors even say to use histogram “sparingly,” but, they are actually useless as the are implemented. Summary types are not much better. This leaves us with a TSDB that only operates well with Counter and Gauge type metric data which is where we are with Graphite. Graphite at least has StatsD. Histograms are an amazingly powerful way of working with event based metrics. To discover, the hard way, that Prometheus doesn’t scale to handle accurate histograms, compound metrics are non-functional, and event based metrics do not have a solid solution makes Prometheus a poor choice for a TSDB.