I’ve been thinking about the “future” and how I can move my metrics system from what I have now to what I’d like to have. I run a large Graphite cluster. In the 26 million metrics per minute range with a quarter petabyte of provisioned storage. I integrate with a Naemon (Nagios fork) and Merlin setup for alerting.
I’ve been following Prometheus for a year and wondering about what the future might be like. Turns out, my fellow Operations team members and the Developers are also highly interested in Prometheus or a tool that offers Prometheus-like features. Specifically:
- Support ephemeral hosts: Be smarter about how metrics are managed so that each host adds metric data without polluting the namespace with thousands of host entries.
- Scale storage: No more Whisper files, storage needs to scale based on the timestamp/value pairs we store rather than a pre-allocated chunk of disk space.
- Scale to a multi- data center environment: Graphite isn’t designed to make multiple clusters in different data centers of regions work well together. Although, modern versions of Grafana can really help there. Prometheus handles this style of sharding natively.
- Ability to tag or label metrics: This makes ephemeral hosts work well combined with storage allocated as needed (rather than allocating all possible storage at once).
- Support advanced metric based alerting: A strength of Prometheus and we can funnel through our Nagios-based monitoring to deal with pager groups etc.
So, how does one get from a monolithic Graphite setup to something like the above? A good question that I’m still trying to work out. Here’s what I’m thinking.
- Keep our Nagios based alerting system. It routes alerts, handles paging time periods, and, most importantly, handles alerts from many different sources via checks. Uses PagerDuty, email, etc. as appropriate.
- Keep the current check_graphite code we are using to do metric based alerting. It enables us to transition when we can and roll back if needed.
- Setup a Prometheus / AlertManager instance for any global aggregation and handle routing of alerts from Prometheus metric based checks to Nagios.
- Upgrade Grafana to 2.5 (or better) to be the global user interface to metrics and be able to pull data from many different sources: Graphite, Prometheus, and Elasticsearch.
- Scale Graphite storage with some form of black magic.
Sharded Systems: These systems are the infrastructure setup as part of each data center or region.
- A Prometheus server to scrape metrics from local systems and services. Each Prometheus server maps and forwards data points to Graphite. Perhaps an identical second server for redundancy. Alerts and aggregate metrics flow upward toward the global Prometheus service.
- A local Graphite/Statsd ingestion service found by service discovery to handle and route old school metrics.
The design of this gives me a Prometheus system we can used for advanced alerting and short term monitoring of metrics with great support for ephemeral hosts and labeling. Graphite still collects unconverted metrics and holds our historical or long term data. Graphite also serves as a long term storage option for Prometheus. (See this patch.)
What’s left unsolved? The hard parts:
- Long term metric storage must scale and Whisper files isn’t cutting it. I need to spend some time with alternate Graphite backends or in writing one. Many of the existing options bring along their own challenges. I am required to keep full resolution data for years.
I have some ideas here. I had hopes for InfluxDB but it does not appear stable. But, I’m thinking something far simpler. More to come here.
Will this work? Will this scale to 20 million metrics or more? Perhaps its worth finding out.