LinuxCzar

Code, Linux, and Operations. The website of Jack Neely.

Rule #4 states that failure will happen, therefore you should plan for the eventual reality. The Linux workstations I build and use (if I have any say about it) use at least 2 hard drives in some mirrored or otherwise redundant fashion. My current patternis to build workstations with a small (120G or there about) SSD drive as the boot drive that contains my OS install and swap space. /home, scratch, and possibly other areas are mounted from a 2 disk mirrored array of spinning rust.

Full Article...

If a job in Linux System Administration / Operations can teach you one thing its how to keep up with the ever changing landscape that Open Source is. I’ve been working with Linux for 20+ years, and with that comes some, hopefully, wisdom of experience. Linux distributions, and Open Source are divergent in terms of change. The more things change, the more things there are to change.

Full Article...

I originally wrote this post in October of 2008. Where did those 9 years go? I think its time for an update. As the Linux Czar, I’m regularly asked to interview folks that are applying to various jobs that require some Linux skills.┬áInterviewing isn’t really my strong point and I always struggle to come up with good questions that will lead the candidate to talk about himself and his skills in a helpful way.

Full Article...

I’m a big fan of using histograms for metrics and visibility. Over a StatsD-like approach that offers a series of summary metrics, histograms give us the ability to: Actually visualize the distribution. You can see if your distribution is multimodal, for example. This is done with a heatmap. Aggregation. You can aggregate histograms (with the same bucket boundaries) together and produce summary metrics for an entire service. Remember, if you generate percentiles for each application instance you cannot aggregate those to get a global percentile for the entire service without the raw data.

Full Article...

I was able to attend Monitorama PDX 2017 this past week and had a blast. If you are interesting in monitoring, metrics, related data analysis, alerting, and of course logs then this is the conference for you. It struck me at Monitorama that many of us came of age in the pre-microservices (Service Oriented Architecture or SOA) world. But services in a SOA environment are different and should be monitored differently than what we may be used to.

Here’s my take on the basic tenets of monitoring infrastructure and best practices for Service Oriented Architectures. I’m an Operations Engineer doing visibility work for a fairly large client so this comes from the viewpoint of the caretaker of monitoring services. If you are a developer and don’t agree, let me know!

Full Article...

Designing systems at scale often means building some sort of “cluster.” These are common and important questions to answer:

Full Article...

One of the killer app features of Prometheus is its native support of histograms. The move toward supporting and using histograms in metrics and data based monitoring communities has been, frankly, revolutionary. I know I don’t want to look back. If you are still relying on somehow aggregating means and percentiles (say from StatsD) to visualize information like latencies about a service, you are making decisions based on lies.

I wanted to dig into Prometheus’ use of histograms. I found some good, some bad, and some ugly. Also, some potential best practices that will help you achieve better accuracy in quantile estimations.

Full Article...

There are many factors that limit the available bandwidth of a network link from point A to point B. Knowing and expecting something reasonably close to the theoretical maximum bandwidth is one thing. However, the latency of the link can vastly affect available throughput. This is called the Bandwidth Delay Product. It can be thought of as the “memory” of the link as well (although that memory is the send/receive buffers).

Full Article...

I’ve been experimenting with Cyanite to make my Graphite cluster more reliable. The main problem I face is when a data node goes down the Graphite web app, more or less, stops responding to requests. Cyanite is a daemon written in Clojure that runs on the JVM. The daemon is stateless and stores timeseries data in Cassandra. I found the documentation a bit lacking, so here’s how to setup Cyanite to build a scalable Graphite storage backend.

Full Article...