# LinuxCzar

Code, Linux, and Operations. The website of Jack Neely.

I originally wrote this post in October of 2008. Where did those 9 years go? I think its time for an update. As the Linux Czar, I’m regularly asked to interview folks that are applying to various jobs that require some Linux skills. Interviewing isn’t really my strong point and I always struggle to come up with good questions that will lead the candidate to talk about himself and his skills in a helpful way.
I’m a big fan of using histograms for metrics and visibility. Over a StatsD-like approach that offers a series of summary metrics, histograms give us the ability to: Actually visualize the distribution. You can see if your distribution is multimodal, for example. This is done with a heatmap. Aggregation. You can aggregate histograms (with the same bucket boundaries) together and produce summary metrics for an entire service. Remember, if you generate percentiles for each application instance you cannot aggregate those to get a global percentile for the entire service without the raw data.

I was able to attend Monitorama PDX 2017 this past week and had a blast. If you are interesting in monitoring, metrics, related data analysis, alerting, and of course logs then this is the conference for you. It struck me at Monitorama that many of us came of age in the pre-microservices (Service Oriented Architecture or SOA) world. But services in a SOA environment are different and should be monitored differently than what we may be used to.

Here’s my take on the basic tenets of monitoring infrastructure and best practices for Service Oriented Architectures. I’m an Operations Engineer doing visibility work for a fairly large client so this comes from the viewpoint of the caretaker of monitoring services. If you are a developer and don’t agree, let me know!

Designing systems at scale often means building some sort of “cluster.” These are common and important questions to answer:

One of the killer app features of Prometheus is its native support of histograms. The move toward supporting and using histograms in metrics and data based monitoring communities has been, frankly, revolutionary. I know I don’t want to look back. If you are still relying on somehow aggregating means and percentiles (say from StatsD) to visualize information like latencies about a service, you are making decisions based on lies.

I wanted to dig into Prometheus’ use of histograms. I found some good, some bad, and some ugly. Also, some potential best practices that will help you achieve better accuracy in quantile estimations.

There are many factors that limit the available bandwidth of a network link from point A to point B. Knowing and expecting something reasonably close to the theoretical maximum bandwidth is one thing. However, the latency of the link can vastly affect available throughput. This is called the Bandwidth Delay Product. It can be thought of as the “memory” of the link as well (although that memory is the send/receive buffers).
I’ve been experimenting with Cyanite to make my Graphite cluster more reliable. The main problem I face is when a data node goes down the Graphite web app, more or less, stops responding to requests. Cyanite is a daemon written in Clojure that runs on the JVM. The daemon is stateless and stores timeseries data in Cassandra. I found the documentation a bit lacking, so here’s how to setup Cyanite to build a scalable Graphite storage backend.

I’ve updated Buckytools, my suite for managing at scale consistent hashing Graphite clusters, with a few minor changes.

In Go 1.5 we have the beginings of vendoring support. The easiest way to incorperate other projects into your Git repo is by using the following command. Short and dirty, but it works:

\$ git subtree add --prefix vendor/gopkg.in/check.v1  \
https://gopkg.in/check.v1 master --squash


I’ve been researching quite a few algorithms for my client, Bruce, as I continue to scale out certain systems. I thought that getting them on my blog would be very useful for a future version of myself and many others. I suspect, and hope, that most folks that work in Systems Administration / Operations will find these at least familiar. Flap Detection Flap detection is the ability to determine rapid state changes so that one can take corrective action.