If a job in Linux System Administration / Operations can teach you one thing its how to keep up with the ever changing landscape that Open Source is. I’ve been working with Linux for 20+ years, and with that comes some, hopefully, wisdom of experience. Linux distributions, and Open Source are divergent in terms of change. The more things change, the more things there are to change.
I was able to attend Monitorama PDX 2017 this past week and had a blast. If you are interesting in monitoring, metrics, related data analysis, alerting, and of course logs then this is the conference for you. It struck me at Monitorama that many of us came of age in the pre-microservices (Service Oriented Architecture or SOA) world. But services in a SOA environment are different and should be monitored differently than what we may be used to.
Here’s my take on the basic tenets of monitoring infrastructure and best practices for Service Oriented Architectures. I’m an Operations Engineer doing visibility work for a fairly large client so this comes from the viewpoint of the caretaker of monitoring services. If you are a developer and don’t agree, let me know!
Designing systems at scale often means building some sort of “cluster.” These are common and important questions to answer:
One of the killer app features of Prometheus is its native support of histograms. The move toward supporting and using histograms in metrics and data based monitoring communities has been, frankly, revolutionary. I know I don’t want to look back. If you are still relying on somehow aggregating means and percentiles (say from StatsD) to visualize information like latencies about a service, you are making decisions based on lies.
I wanted to dig into Prometheus’ use of histograms. I found some good, some bad, and some ugly. Also, some potential best practices that will help you achieve better accuracy in quantile estimations.
I’ve updated Buckytools, my suite for managing at scale consistent hashing Graphite clusters, with a few minor changes.