LinuxCzar

Code, Linux, and Operations. The website of Jack Neely.

I’m a big fan of using histograms for metrics and visibility. Over a StatsD-like approach that offers a series of summary metrics, histograms give us the ability to: Actually visualize the distribution. You can see if your distribution is multimodal, for example. This is done with a heatmap. Aggregation. You can aggregate histograms (with the same bucket boundaries) together and produce summary metrics for an entire service. Remember, if you generate percentiles for each application instance you cannot aggregate those to get a global percentile for the entire service without the raw data.

Full Article...

I was able to attend Monitorama PDX 2017 this past week and had a blast. If you are interesting in monitoring, metrics, related data analysis, alerting, and of course logs then this is the conference for you. It struck me at Monitorama that many of us came of age in the pre-microservices (Service Oriented Architecture or SOA) world. But services in a SOA environment are different and should be monitored differently than what we may be used to.

Here’s my take on the basic tenets of monitoring infrastructure and best practices for Service Oriented Architectures. I’m an Operations Engineer doing visibility work for a fairly large client so this comes from the viewpoint of the caretaker of monitoring services. If you are a developer and don’t agree, let me know!

Full Article...

Designing systems at scale often means building some sort of “cluster.” These are common and important questions to answer:

Full Article...

One of the killer app features of Prometheus is its native support of histograms. The move toward supporting and using histograms in metrics and data based monitoring communities has been, frankly, revolutionary. I know I don’t want to look back. If you are still relying on somehow aggregating means and percentiles (say from StatsD) to visualize information like latencies about a service, you are making decisions based on lies.

I wanted to dig into Prometheus’ use of histograms. I found some good, some bad, and some ugly. Also, some potential best practices that will help you achieve better accuracy in quantile estimations.

Full Article...

There are many factors that limit the available bandwidth of a network link from point A to point B. Knowing and expecting something reasonably close to the theoretical maximum bandwidth is one thing. However, the latency of the link can vastly affect available throughput. This is called the Bandwidth Delay Product. It can be thought of as the “memory” of the link as well (although that memory is the send/receive buffers).

Full Article...

I’ve been experimenting with Cyanite to make my Graphite cluster more reliable. The main problem I face is when a data node goes down the Graphite web app, more or less, stops responding to requests. Cyanite is a daemon written in Clojure that runs on the JVM. The daemon is stateless and stores timeseries data in Cassandra. I found the documentation a bit lacking, so here’s how to setup Cyanite to build a scalable Graphite storage backend.

Full Article...

I’ve been researching quite a few algorithms for my client, Bruce, as I continue to scale out certain systems. I thought that getting them on my blog would be very useful for a future version of myself and many others. I suspect, and hope, that most folks that work in Systems Administration / Operations will find these at least familiar. Flap Detection Flap detection is the ability to determine rapid state changes so that one can take corrective action.

Full Article...

This is a update of an old post so its back at the top of the blog. Original posting was 2015-08-19. I’m considering swapping out Statsd with Bitly’s statsdaemon for better performance. But, because Bitly’s version only accepts integer data I wanted to analyze our Statsd traffic. I figured I’d use my friend tcpdump to capture some trafic samples and replay them through a test box for analysis. Also, figuring out what are our hot metrics is very handy.

Full Article...