Calculating the Error of Quantile Estimation with Histograms
I’ve posted about Histograms before. They fascinate me with the prowess they bring to recording telemetry about high volume events. The representations on disk can be quite small, but they can be used to produce a slew of highly accurate summary statistics. Histograms are both Robust and Aggregatable data structures and its rare to have a data type that combines both of those features in the field of Observability and especially with metrics. Yet they are often not understood, and implemented in time series databases poorly. However, my fascination continues.
Debated here previously is the error involved when we use a Histogram to estimate a Quantile or Percentile. This is really the measure that matters. Just because your Observability solution says its using Histograms doesn’t mean its using the same algorithms or has high quality estimations of quantiles. Being able to tell the good from the bad makes you more aware of the quality of your data and leads your business to better decisions because of higher quality Data Science.
Here I’m going to spend some time talking about how Histograms are used to estimate Quantiles and what the expected error can be. Let’s look at a few definitions just to get started.
I’m going to use the Estimation Error to judge these algorithms. It is, by far, the simplest to understand and work with. Expressed as a percentage of the true value. Its really just how far we missed the mark. Or how far could we potentially be wrong.
$$ Error = \frac{\mid TrueValue  Estimation \mid}{TrueValue} $$
The Histogram
A Histogram represents a set, $S$, of sample values. These values are categorized into bins that represent a range of possible values. The bin’s value is the count of the number of samples it contains. Bin ranges monotonically increase which ensures that all sample values in bin $b$ are less than samples values in bin $b+1$. Therefore, bins are ordered. The following example has bins representing every 0.2 seconds.
Within a specific bin the actual sample values are unknown. However, it is known that there are $n$ samples as that is the value of the bin. The lower and upper bounding values are known as well. Implementations vary here on how the bounding values are handled, but this doesn’t have much impact on the quantile error. In most cases the subset of $S$ which is categorized into bin $b$ can be defined as the following. All values $x$ are included in the bin if they are greater than or equal to the lower bound and less than the upper bound.
$$ \{ x \mid LowerBound \le x \lt UpperBound \} $$
Further, the values in $b$ can be represented as an ordered set even if they are not observed in this order.
$$ \{ x_1 \le x_2 \le x_3 \le \cdots \le x_{n1} \le x_n \} $$
Therefore, we can operate on the entire set $S$ that makes up all the data recorded in the histogram as an ordered set. When calculating a Quantile, Percentile, Median, or any summary statistic in that family, the first task is ensuring the set of sample values is ordered.
Defining Quantiles
We define a Quantile, or $Q(p)$, of set $S$ with the length of $S$ known as $n$ as a cut point value where $p \cdot n$ values of the set are less than or equal to the cut point value. For an ordered set of $n = 100$ the cut point value representing $Q(0.95)$ is the value found at the 95th element in the set. But what happens when $p \cdot n$ has a fractional part? Say $n = 107$ and we need to find the 101.65th element?
There are at least 9 different algorithms for computing Quantiles on a sample set commonly used in Data Science and telemetry calculations. The 1996 paper Sample Quantiles in Statistical Packages by Hyndman and Fan explore these 9 types. Its a really concise paper as well. Fans of the R language might recognize the authors of the Quantile functions are the same.
As we explore Histograms used with telemetry platforms I’m going to stick with Definition 1 or sometimes referred to as Type 1 Quantiles. Here, when we find the index is between two elements, as with $n = 107$, we return the value of the 102nd element to satisfy $Q(0.95)$. This is the ceiling of $p \cdot n$. Using $k$ to denote the $k$th sample in $S$:
$$ \lceil p \cdot n \rceil = k$$
For the example with 107 elements in the set, the location of the sample value returned for $Q(0.95)$ is as follows.
$$ \lceil 0.95 \cdot 107 \rceil = 102$$
$$ Q(0.95) = S[102]$$
Estimating Quantiles and Bin Width
With this we can show which sample and which bin reflects an arbitrary quantile in a histogram. No knowledge of the actual samples values is needed! This includes the set minimum, maximum, and median as they are all “special” quantiles (0, 1, and 0.5 respectively). This gives a range of valid guesses or estimates of arbitrary quantile values for the set. If the histogram bins are set using Service Level Objective values, we can prove if an arbitrary quantile is less than or greater than a given SLO threshold. Obviously, bin widths matter. The more bins the histogram has, the smaller the range of valid guesses or estimates will be. Whether these boundaries are generated for us or specified by us is implementation specific. This plays a really big role in controlling the error of quantile estimations.
The next question is what interesting statistical tricks can we play to narrow down the error of the estimate within a bin. The initial Graphite Retrieval Times histogram above has a $Q(0.95)$ at the 9,501st sample, which is in the 8th bin, and is the 93rd sample out of the 368 samples in that bin. The method of choice here is, again, implementation specific and does a lot to fine tune the expected error.
Statistically speaking, the worse case for error happens when bins are sparsely packed, and specifically when there’s only 1 sample (the quantile value we seek) in the target bin. Depending on the algorithm that generates bin boundaries, we can pretty safely assume that the more samples that are present in the bin, the smaller the error will be in the estimation. The sample values simply get closer together as they form their own distribution that we can model throughout the bin width. A single sample could be very close to the upper or lower bounds of the bin and the estimation algorithm across the bin width may chose the other extreme!
A slight tangent here. A common question I’ve seen is how to estimate the average of the samples in a histogram. I’ll caveat with the fact that some implementations of histograms for telemetry will also record the sum of all observations. In that case, it is simple to divide that sum by the count of all samples to get an exact average. However, some implementations do not keep track of the sum. Again, this assumes an understanding of the distribution of values throughout each bin. For a normal distribution we can take the midpoint of each bin and multiply by the bin count. Sum this across all bins and finally divide by the total sample count.
$$ \frac{\sum_{i=1}^{bins} MidPoint \cdot Count}{\sum_{i=1}^{bins} Count} $$
Remember, averages can be misleading!
Expected Error of Various Histogram Implementations
Wow, that was a lot of backstory. I bet you came here because you just wanted to see how accurate histograms can be and how some of the modern metric based platforms stack up. Well, here we are. For each platform, I’m going to cover several specific points. Hopefully, this helps make a more apples to applies comparison.
 Algorithm for choosing bin widths
 Maximum estimation error
 Distribution used to model samples within a bin
 Expected estimation error
 Support for exact averages
Let’s setup some common variables here.
Variable  Description 

$p$  Quantile value: $0 \le p \le 1$ 
$n$  Count of samples or observations in a histogram bin. 
$k$  The $k$th sample in the bin where $k \ge 1$. 
$lower$  The bin lower bound. 
$upper$  The bin upper bound. 
$width$  The bin width: $upper  lower$ 
Prometheus
I really like Prometheus a lot. Its an incredibly powerful tool. However, as you may guess from reading this blog, it doesn’t win a lot of points here.

Histogram bin sizes are chosen by the developer that integrates Prometheus instrumentation with an application. They can set whatever bucket values they like. Usually a human isn’t going to type out more than 10 or so, and using more than 10ish can significantly increase the time series footprint in Prometheus. Usually humans don’t chose great boundaries and quantiles aren’t accurate at all. But we can tell if a quantile is greater than SLO thresholds if those are used a bin boundaries. I no longer recommend Prometheus’s histogram implementation for SLO style measurements.

All of it. Maximum estimation error is not limited or controlled. This is the problem.

Prometheus uses linear distribution model to interpolate the location of the sample value in the bin. Samples are assumed to be equally distant from each other. Worst case is one sample in the bin and the interpolated value will be the upper bound of the bucket leaving the full bucket with for possible error spread.
$$ Q(p) = lower + width \frac{k}{n} $$

Uncontrolled.

Prometheus does keep a sum of all samples so exact values can be found This can be the saving grace when histograms are missconfigured as the average will at least let you know the orders of magnatude your data lives in when its outside the histogram bins.
Circonus
Circonus is a paid product, although they have quite a lot of code under a BSD license on GitHub. Using excellent data structures and highly accurate math is where Circonus shines. They are the gold standard here.

Loglinear algorithm that automatically creates bins on the fly. This is very similar to HDRHistogram set to 2 significant figures. Empty bins are omitted from the data structure and no tuning by the developer is needed. Each base 10 order of magnitude has 90 bins including ones that may have zero samples and thus omitted from the data structure.

As the loglinear algorithm here uses 2 significant figures for bucketing there is usually a 10% maximum estimation error. This isn’t seen in practice and would take an exceedingly contrived data set to get the estimation error over 5%.

Circonus uses a Beta Distribution to model sample data within the bin. This is defined as $Beta(k, n+1k)$.
$$ Q(p) = lower + width \frac{k}{n+1} $$

Worst case, again, is when the bin contains only one sample. The Beta Distribution will then pick the bin midpoint as the value of the estimate. This sets expected error at < 5%. With more samples in the bin the probabilities are that they will be closer to the estimation. With a reasonable sized sample set in the bin, this can produce really accurate estimations.

Sums of observations are not kept, but the data structure can generate an estimated average using the same algorithm as above.
Victoria Metrics
Victoria Metrics have expanded PromQL slightly and have improved histogram usability while doing so. Of course, only the Victoria Metrics client library can generate these histograms and the Victoria Metrics telemetry engine is required to parse them and build quantile estimates off of the modified data structure. However, the query is what you’d expect from PromQL.
 Its a loglinear algorithm where each order of magnitude is divided into 18 evenly spaced bins. The bin boundaries follow this geometric pattern.
$$ \{ 1.0 \times 10^x, 1.5 \times 10^x, 2.0 \times 10^x, 2.5 \times 10^x, \cdots, 9.0 \times 10^x, 9.5 \times 10^x, 1.0 \times 10^{x+1} \} $$

The maximum Estimation Error inside a bin is 50%.

Like Prometheus, Victoria Metrics assumes a linear distribution of samples in each bin. The code is functionally equivalent here. I will note that they convert the modififed histogram structure to the standard
le
based structure and then use the same quantile estimation algorithm.
vv := lePrev + (lelePrev)*(vReqvPrev)/(vvPrev)

The distribution model still allows for the worst case of one sample in the bin to return an estimation that is either the lower or upper bound of the bin. So the estimation could be off by an entire bin width. The worst case is still 50% which is miles better than Prometheus’s native histogram type. Being that the bin widths are handled by a loglinear style algorithm, in practice, error should be much lower. Especially so when bins have many samples or are not sparse.

Like Prometheus the sum of all samples is tracked and this structure can produce exact averages of the data in the histogram.
Final Thoughts
I really appreciate the Victoria Metrics folks pushing on the Prometheus ecosystem to start fixing the histogram implementation problems. There’s a lot of credit to be given there. Also, their solution should be generally useful without having to plan for Prometheus’s caveats. With enough samples I would think seeing a 10% error or less would be normal.
Are there Open Source (or even commercial) products that I’ve left out? I know OpenTSDB has histogram support and TDigests. However, I have not had the opportunity (yet) to master this tech stack. Wavefront also implements histograms reasonably well using TDigests. They are another SaaS solution for metric data that I would highly recommend due to their focus on good mathematics.
I took a lot of math courses in college and really enjoyed them. That’s no surprise if you are still reading. But I was only required to take 1 Statistics course for my Engineering degree. I had no idea that I’d spend so much of my professional career in Statistics. But, that said, I’m still learning. My own conclusions I came to after doing this research is that using the Estimation Error really isn’t a great description of error for these sketches. Perhaps a more advanced understanding here would help me better master these concepts. Knowing the level of error in these sketches is critical to good decision making and this error can be difficult to reason about.
If you have suggestions please leave them below. If you would like this kind of expertise on your team, please give me an email.