Prometheus Histograms Part 3: Using Something Else
Telemetry and monitoring with a Time Series Database (TSDB) is a set of 3 overlapping problem spaces. Those are:
- Ingestion and Retrieval of Raw Data Points – Operational Intelligence
- Aggregation and Roll Ups of Data for Long Term Retention and Trending – Business Intelligence
- Efficient Storage and Retrieval of High Velocity Event Data as a Statistical Model – Histograms
Most solutions that exist are really only aware of 2 of these problems. The implementation usually highly focuses on one of them. This, I believe, is why there are so few products that handle TSDB data at scale and provide meaningful value. Which is why everyone on the Internet seems to write their own TSDB to solve what another product didn’t.
Where is Prometheus on this chart? Prometheus clearly focuses on #1 for Operational Intelligence. The documentation and authors are pretty clear about keeping Prometheus’s focus very narrow on Operational Intelligence. Which means that #2 is specifically not a problem that Prometheus aims to solve. Thanos steps in here to solve some of these issues. There are also a number of commercial solutions that ingest from Prometheus and give some coverage here.
The data model that Prometheus uses (a pull based model) allows for handling high velocity data to a degree. The Prometheus client libraries keep Counters in memory and can increment them per event and these summaries of the events can be scraped at a normal scrape interval. Each event doesn’t create network traffic to report raw data points to Prometheus which solves the ingestion issues. However, Prometheus is beholden to what ever summary metrics that the client presents. This is problematic when clients offer different summary metrics or do the math differently. A load of oranges and apples results.
Prometheus takes advantage of its data model to implement some level of support for #3. With a clever usage of Counter metrics Prometheus has created compound metrics in the Summary type and the Histogram type. I’ve blogged about Histograms in Prometheus before (see Part 1 and Part 2) and my findings are basically that Prometheus’s Histograms are next to useless. Which, as powerful as Histograms can be, is definitely a disappointment. I really hate it when my metrics lie to me (and you should too), and that’s what Prometheus’s Histograms are doing.
Commenters have asked that I give an update on using Prometheus Histograms, and its about time I did! What are the goals we are trying to achieve with Histograms? Prometheus only exposes the Arithmetic mean of the observations and presents functions to estimate quantiles. We will use these in an attempt to calculate some basic Service Level Objectives (SLOs) that would be handy for monitoring and alerting.
First, make sure your SLO values are parameters to your application. If you have used Prometheus Histograms in the past you should already be doing this to set bucket boundaries equal to your SLO values. You can have multiple SLO values too. Don’t forget to export them as metrics! They’ll be handy:
foobar_pipeline_slo_seconds 10
This example states that our service level objective is to handle events in the pipeline within 10 seconds.
Next, use 4 Counters for the pipeline of events. Track the total number of events, the total number of errors, and the total number of events that took longer than the SLO time to process. Also, keep a running sum of the event processing durations in a Counter as well. You can use labels a touch more freely here as this example does not have the extra cardinality dimension of a Histogram. Astute readers will note this is still super set of a Summary metric.
foobar_pipeline_events_total{handler="foo", client="Chrome"} 42
foobar_pipeline_events_errors_total{handler="foo", client="Chrome"} 3
foobar_pipeline_events_over_slo_total{handler="foo", client="Chrome"} 1
foobar_pipeline_events_duration_seconds{{handler="foo", client="Chrome"} 138.4
Your application is now instrumented! Let’s build an aggregated average like what a Histogram would give us.
sum(rate(foobar_pipeline_events_duration_seconds[5m]))
/
sum(rate(foobar_pipeline_events_total[5m]))
What about a ratio of event errors to total?
sum(rate(foobar_pipeline_events_errors_total[5m]))
/
sum(rate(foobar_pipeline_events_total[5m]))
Now for a “real” SLO! Above we have stated that we want to process events in under 10 seconds. Let’s finish up the SLO by expanding that to processing 95% of events in the last 24 hours in under 10 seconds.
1 - rate(foobar_pipeline_events_over_slo_total[1d])
/
rate(foobar_pipeline_events_total[1d])
< 0.95
Its not a real SLO measurement due to Prometheus’s lack of handling Business Intelligence data easily, but its close enough for most folks. Shucks, let’s look at a Google SRE Burn Rate for the past hour. This is much more suitable for Prometheus.
sum(rate(foobar_pipeline_events_errors_total[1h]))
/
sum(rate(foobar_pipeline_events_total[1h]))
/
(1 - 0.95) > 14.4
In this Burn Rate example, where we are burning away our error budget, we are working with a slightly different goal. Something much closer to a real SLO. In this case we have an SLO to handle 95% of events in under 10 seconds for 30 days. When the burn rate exceeds 14.4 for an hour, we know that we have burned 2% of our error budget in that hour.
The best part of these methods? They are 100% accurate. These are absolutely not estimations.