Prometheus and Histograms

One of the killer app features of Prometheus is its native support of histograms. The move toward supporting and using histograms in metrics and data based monitoring communities has been, frankly, revolutionary. I know I don’t want to look back. If you are still relying on somehow aggregating means and percentiles (say from StatsD) to visualize information like latencies about a service, you are making decisions based on lies.

I wanted to dig into Prometheus’ use of histograms. I found some good, some bad, and some ugly. Also, some potential best practices that will help you achieve better accuracy in quantile estimations.

Graphite Latencies

To run these experiments, I needed some test data. I pulled out of ELK the render times for my Graphite cluster. These are the latencies for 10,000 Graphite queries (time to build a graph or return JSON results) in seconds. Prometheus doesn’t have good support for heatmaps or otherwise visualizing the raw histogram so I’ve tossed the data into R. Let’s assume for illustration purposes that this represents 1 minute’s worth of requests.

This isn’t very visually appealing. The latency data (as you might imagine) approaches zero. There is a very long tail and most of it is to the right of the graphed area to make the histogram readable. Each bin in the histogram is 0.2 seconds.

This is our very basic statistical analysis. Here $\mu$ (mean or average) is 0.2902809, $\sigma$ (standard deviation) is 0.7330352, and $q(.95)$ is 1.452352. These are the actual values as calculated by R on the raw data set.

Prometheus Quantile Estimation

To calculate a quantile on a histogram is actually an estimate where the error depends on the granularity of the histogram’s bin widths. Being that the data in a histogram is naturally ordered you know exactly what bin contains an arbitrary quantile. Prometheus (and many other tools, as its about the only way we have) then estimates the correct value by doing linear approximation over the selected bin.

Out of 10,000 samples the 9,501th falls into the 8th bucket. The 8th bucket has 368 samples and the 9,501th sample is the 93rd sample in this bucket.

Around line 107 in promql/quantile.go in the Prometheus source code we find the actual code the does the approximation:

return bucketStart + (bucketEnd-bucketStart)*float64(rank/count)

Doing this on paper:

$$(7 * 0.2) + 0.2 * \frac{93}{368} = 1.45054348$$

This gives an error of 0.12% from our known correct value. Pretty good.

What Went Right?

This gives an illusion that Prometheus’ quantile estimation isn’t horrible – and it isn’t. But there are a ton of caveats working in our favor when modeling this with R.

R makes it easy to build a model for the data. Prometheus requires us to know the data model first. How do you decide the bin widths?
Bin sizes small compared to Standard Deviation.
Someone typed in 144 bin boundaries and magically knew to stop at 29 seconds.

What Might Have Gone Horribly Wrong?

The Prometheus folks generally advise to chose relatively small bin widths according to the possible range and to be sure to include your SLA as one of the boundaries. This is pretty vague guidelines.

Here the bin boundaries are: ${.05, .1, .5, 1, 5, 10, 50, 100}$. Where $q(.95)$ is sample 711 of the 1187 in bin 5.

This has been my common experiance coaching folks to get the best data into and out of Prometheus. Its very common to setup just a few bin boundaries, choosing many where the bulk of the data points are expected, and a few larger widths to encompass the tail. This doesn’t even look like it will be an accurate estimation, and it isn’t.

On paper:

$$1 + 4 * \frac{711}{1187} = 3.39595619$$

This produces an error of 234%. ‘nough said.

Log-Linear Histograms

Other histogram implementations available use a “log-linear” approach to bin widths in histograms. Notably, HdrHistogram and Circonus’s implementation. The direct advantage is that bin widths do not need to be specified by the user. Bin widths are automatically chosen by using a specific number of significant digits and an exponent. 2 significant digits achieves a pretty high level of accuracy and what the Circonus implementation uses. The downside is that the number of bins that we must keep up with is variable.

Example: If we observe the number 1,234 and add it to a histogram we would increment the total number of observations in the bin defined as $1.2 \times 10^{3}$. This implies that all observations counted in the same bin are within 1%, so this style of histogram will produce quantile estimations that have a worst case error of 1%.

This is a visualization of a Log-Linear histogram. It is a logarithmic representation of multiple linear histograms. Each bin with the same exponent has the same width (which is why they appear to become smaller in this logarithmic plot). The latency data is also much better visualized in a logarithmic style. For example, you can tell now that there are multiple modes.

Let’s work our quantile estimation algorithm again and see how it stacks up. $q(.95)$ is now in the 365th bucket as the 85th of 178 samples. R still calculates the raw $q$ value to be $10^{0.1620718} = 1.452352$. Linear approximation gives us:

$$0.146128036 + (0.176091259 - 0.146128036) * \frac{85}{178} = 0.1604363$$ $$10^{0.1604363} = 1.446893$$

The error is 0.4%. Not quite as good as the first “perfect” example, but this is achievable without knowing bin boundaries or much other information about the data before hand. Also, these histograms are perfectly aggregatable as they always use consistent bin widths. (You cannot aggregate histograms in Prometheus if the boundaries are different.)

Prometheus and Accurate Quantile Estimations

So, how do we use this information to make better use of histograms in Prometheus? The challenge here is that Prometheus represents histograms in a fixed number of bins – each of which is simply a counter type metric. We cannot arbitrarily add new bins.

What do we know about the data our application will observe into a Prometheus histogram metric? I bet we intuitively have a much better idea of the range or orders of magnitude of that data than the data’s distribution. So I suggest to my users they they use an algorithm to produce a list of histogram bin boundaries rather than hard coding some best guess boundaries.

def logLinearBuckets(minExp, maxExp):
    return [ d * 10**(e-1) for e in range(minExp, maxExp+1) for d in range(10, 100) ]

Plug in the exponents to represent the range of orders of magnitude for your histogram. The above example used a range of -4 to 2. Full Python source is available here with more docs and unit tests.

Closing Thoughts

The final Log-Linear histogram would be represented in Prometheus with 631 bins which is a lot more than common usage. I’ve been asked about how this impacts performance and, yes, there is a cost there. But we are still well within the range of expectations of the Prometheus authors so the very little bit of extra storage and query compute time is well worth knowing the error of quantile estimations up front.

The first “perfect” example would be presented in Prometheus in 145 bins assuming we know to stop at 29. If we did not know to stop and used a range of 0 - 100 then 500 bins would be needed. This again shows that the Log-Linear method uses a very similar amount of resources as compared to what a histogram might be coded to after much trial and error in the name of accuracy.

I’d like to propose this method as a best practice to generate bin boundaries for Prometheus histograms. Its no more expensive than what one would gravitate to for better accuracy and it functions with very little knowledge of the data before it is observed. Finally, you know the error of your quantile estimations up front.

LinuxCzar