Working with a large and consistent hashing Graphite cluster I came across
corrupt files. Corrupt files prevent
carbon-cache.py from storing data
to that specific metric database file. The backlog was starting to tank
the cluster. I whipped out
find and removed all zero-length files, as
that is a common corruption case.
find /opt/graphite/storage/whisper -depth -name *.wsp -size 0c -type f -delete
However, I had a few more cases that were not zero-length files. A quick
bit of Google’ing did not find much. Usually, reading the header
of the WSP file is enough to have the Whisper code throw an exception, so
using that I wrote Whisper-FSCK.
It will scan your tree of Whisper files and look for corrupted ones. With
-f argument it will move those files out of the way.
Pull requests welcome!
Introducing StatsRelay, a proxy
daemon for Statsd style metrics
written in Go.
What does it do?
StatsRelay is designed to help you scale out your ingestion of Statsd
metrics. It is a simple proxy that you send your Statsd metrics to. It
will then forward your metrics to a list of backend Statsd daemons. A
consistent hashing function is used with each metric name to determine
which of the Statsd backends will receive the metric. This ensures that
only one Statsd backend daemon is responsible for a specific metric.
This prevents Graphite or your upstream time series database from
recording partial results.
Why would you use it?
Do you have an application tier of multiple machines that send updates
for the same metric into Statsd?
When you need to engineer a scalable Statsd ingestion service you need a
way to balance between more than one Statsd daemon. StatsRelay provides
that functionality. You can also use multiple StatsRelay daemons behind
a UDP load balancer like LVS to further scale out your infrastructure.
StatsRelay is designed to be fast and is the primary reason it is
written in Go. The StatsRelay daemon has been benchmarked at handling
200,000 UDP packets per second. It batches the metrics it receives into
larger UDP packets before sending them off to the Statsd backends. As
the string processing is faster than system calls, this further
increases the amount of metrics that each Statsd daemon is able to
When shouldn’t you use StatsRelay?
In many cases you might want to run Statsd on each client machine and
let it aggregate and report metrics to Graphite from that point. If each
client only produces unique metrics names this is the approach you
should use. This doesn’t work, however, when you have multiple machines
than need to increment the same counter, for example.
What’s wrong with Statsd?
Etsy’s Statsd tool is really quite excellent. Its written in NodeJS
which, event driven it may be, is not what I would call fast. The daemon
is a single process which only scales so far. Testing showed that the
daemon would drop packets as it approached 40,000 packets per second as
it would peg the CPU core it ran on at 100%. I needed a solution for an
order of magnitude more traffic.
But, Hey! Statsd comes with a proxy tool!
New versions of Etsy’s Statsd distribution do come with a NodeJS proxy
implementation that does much the same thing. Similar to the Statsd
daemon the code, in single process mode, would top out around 40,000
packets per second and 100% CPU. Testing showed that the underlying
Statsd daemons were not getting all of that traffic either.
I checked back on this proxy after it had been developed further to find
that it had a
forkCount configuration parameter and what looked like a
good start at a multi-process mode. I tested it again with my statsd
load generator which produced about 175,000 packets per second, which
was well inside the packets per second I needed to support in
production. Setting the
4 I found 4 processes each
consuming 200% CPU and 2G of memory each. The code was still dropping
At about 175,000 packets per second this Go implementation uses about
10M of memory and about 60% CPU. No packets lost.
Fork the StatsRelay repository
and submit a pull request on GitHub.
Things that need work:
- Add health checking of the underlying Statsd daemons
- Profile and tune for speed and packet throughput
Documentation is every IT professional’s job. I keep a 3x5 notecard with
a generic layout I use to write documentation and all of my IT related
stuff follows this pattern. I’m actually tired of keeping up with the
card, so I’m going to put it here.
I lifted and simplified this layout Tom
Limoncelli’s Ops Report
Card section on documentation.
Overview or Summary
- A summary of what this is.
- Where does this service live?
- Why do we need it?
- Upstream documentation
- Design (Perhaps its own section outright)
- Diagrams of logic or data flow
- Other moving parts that make the whole
- Subject Matter Expert contacts
Common Tasks or Process
- Common tasks needed for care and feeding.
- If this is a documented process that process, step by step, goes
Deployment or Building
- Do we build the software locally, how do we do it
- How do we deploy more of these machines or replace busted ones
- Where are our configs in Puppet/Chef/Ansible
- Hardware Requirements
- How might the system fail?
- What does failure mean?
- What risks do we run?
- What to do to restore each service or part
- What side effects happen when specific parts are down or
Disaster Recovery Plans
- How is (or isn’t) this system recoverable from a disaster situation?
- What disasters have we planned for?
- HA plans can fit here too
- Steps that need to happen to recover
- Service Level Agreement (either real/legal or social)
- Any notes about the service
- Things that don’t fit well above
- Future to do or improvements
- Uncommonly needed tasks
Docker encourages its users to build
containers that log to standard out or standard error. In fact, its now
to do so. Your process controller
should combine the logs from all processes and do something useful with
them. Like write them all to a file without your app having to worry
about locking. Or, maybe even to Syslog.
Docker supports this practice and collects logs for us. In JSON to add
missing timestamps and to work well with
LogStash. But, what is a show stopping issue for
us is that these files grow boundlessly. You cannot use the logrotate
utility with them because the Docker daemon will not re-open the file.
Well, unless you stop/start the specific container. Docker logging
issues are an ongoing
and this is clearly an area where Docker will improve in the future.
There are two other widely accepted ways of working around this:
- Bind mount in
/dev/log and off load logs to the host’s Syslog
- Mount a volume from the host or a different container where logs
will be processed.
The second point is out. Same problem of not being able to easily tell
the app to re-open files for log rotation without restarting the
/dev/log and off loading logs to the system’s log daemon sounds
like a good idea. The Docker host can provide this service arbitrarily
to all containers. Containers need not deal with (much) logging
complexity inside them.
This approach has multiple problems.
Off loading logs to the host’s Syslog most likely means that you want to
add some additional configuration to rsyslog which requires a restart of
the rsyslog daemon. (Say, you want to stick your logs in a specific,
app-specific file.) The first thing rsyslog does when it starts is
/dev/log socket. At this point, any running Docker
container that has already bind mounted
/dev/log now has an old socket
not the newly created one. In any case, rsyslog is no longer listening
to any of the currently running containers for logs. Full stop. This
method doesn’t pass the smoke test.
What ended up working for me was using the network, but it added
complexity to the Docker host. I’m managing Docker hosts with
Ansible so this wasn’t a huge problem.
I’d rather tune my Docker hosts than alter each image and container. I
set the network range on the
docker0 bridge interface to a specific
and private IP range. Now, my Docker hosts always have a known IP
address that my Docker containers can make connections to. In
DOCKER_OPTS="--ip 127.0.0.1 --bip 172.30.0.1/16"
I configured rsyslog on the host to listen for UDP traffic and bind only
to this private address:
I then built my image to run the process with its output piped to
logger using the
-n option to specify my syslog server. Guess what.
util-linux in Ubuntu Trusty (and other releases) is 2.20 which
dates from 2011-ish. The
logger utility has known
Specifically that the
-n option is ignored silently unless you also
specify a local UNIX socket to write to. This version of
also does not have the
nsenter command which is very handy when
working with Docker containers either. (See
nsenter.) This is a
pretty big frustration.
The final solution was to make my incantations in my Dockerfiles
slightly more complex for apps that do not directly support Syslog. But,
it works. :
CMD foobar-server --options 2>&1 \
| logger -n 172.30.0.1 -u /dev/null -t foobar-server -p local0.notice
I promise I’m not logging to
I’ve been thinking about and wanting to write about packages for a long
time. DEBs. RPMs. Pip for Python. CPAN for Perl. Galaxy for Ansible.
Registry and Docker. Puppet modules from Puppet Forge. Vagrant Boxes.
Every technology comes with its own distribution format and tool it
My recent transition from RHEL to Ubuntu has made one thing very clear.
This mess of packages is intractable. No package format is aware of the
others yet they usually have dependencies that interconnect different
package types. Pip has no knowledge of C libraries required by many
Python packages to build. Us SAs usually end up crossing the streams to
produce a working environment. Or we spend hours building packages of
one specific type. (Only to spend even more time on them later.) The end
result is often different package management systems stepping on each
other and producing an unmaintainable and unreproducible system.
I’ve spent, probably, years of my career doing nothing but packaging.
The advantages of packages are still just as relevant today as they were
in the past. Its a core skill set for running large infrastructures.
Recently, I’ve just about given up trying to deal with packages.
Throw-away VMs. Isolation environments. Images. Advanced configuration
management tools. Applications with conflicting requirements. Does
maintaining a well managed server even matter any more?
I believe it does. A well managed host system keeps things simple and
the SAs sane. However, I believe that there should be a line drawn in
the sand to keep the OS – and tools that manage the OS – separate from
the applications running on that machine or VM. On the OS side of the
line, RPMs or DEBs rule. Configuration management has an iron fist. Your
configuration management and automation should also deploy your
application containers. But now we find the line in the sand.
Your applications, its crazy requirements, as well as whatever
abominable package management scheme needed to get the job done should
live in Docker containers. Here, your configuration management is a git
repo where you can easily rebuild your images. Here, we can use the
tools we need that work the best for the situation at hand without
causing harm to the host system or another application.
Perhaps Docker “packages” are, finally, the one packaging system to rule
There’s just one thing that itches. I know Fedora out right bans it.
Packaging libraries with your applications means that when OpenSSL has a
security vulnerability, you have to patch your OS – and find everywhere
else that library has been stuffed. Itch. Docker containers seem
reasonable about this, but it still means rebuilding and restarting all
The last several months have been a deep dive into
Ansible. Deterministic. Simplistic.
Ideally push based. Uses standard YAML. (I’ve never been much for
inventing your own language.) Most of this work has been with Amazon’s
EC2 service and Ansible’s included dynamic inventory plugin for EC2 has
However, there’s a lot more this inventory plugin could do. I’ve
developed a series of patches, some of which I’ve submitted pull
requests for. All of which can be had by cloning the
master branch of my
GitHub Ansible fork.
- Do you have more than one AWS account? Need to manage machines in
all of them? The
ec2.py to query multiple accounts given the proper
- Making groups from various tags and attributes in AWS is handy. But
I wanted a way to just make arbitrary groups happen. The
branch supports reading groups from the
ansible_groups AWS tag on
your EC2 instances.
- Need additional information about the EBS stores mapped to your
branch exposes this as inventory variables from
One thing I’ve always have difficulty keeping straight is what the
free command in Linux tells you about buffers and cache memory. Is
that buffers and cache used and free (making you do the math of how much
real free memory you have) or does the
free command do the math for
you? In tuning a Nagios setup, I researched it to make sure I had it
Misunderstandings about how Linux uses RAM are common and those
misconceptions can lead to quite a few false positives in monitoring
machines for memory pressure conditions. Linux doesn’t eat
RAM, although it will try to use as much
RAM as possible to cache block devices which makes your machine much
free does the math for you. Its goal is to inform you about
free memory, not make you do math. :
$ free -m
total used free shared buffers cached
Mem: 257948 256570 1377 0 3336 238963
-/+ buffers/cache: 14270 243677
Swap: 262127 1 262126
-/+ buffers/cache line’s two values are:
- Buffer/cache memory used subtracted from total used memory.
- Buffer/cache memory used added to total free memory.
So, in the above example, the value
243677 is the value you want to
monitor with Nagios /
Graphite and what us humans would like
to see as the free memory metric.
More precisely the two values above are calculated like so:
- Total Used Memory - Buffers - Cached
- Total RAM - Value obtained in #1
/proc/meminfo for the gory details.
I’ve been thinking about my keyboards of late. This is mine.
I’ve had a collection of IBM Model M keyboards since before college and
I’ve picked up a few more since. My keyboards are old enough to drink.
They click at each other in the ABC store. They feel like home.
Fortunately, what I have known for years (that mechanical keyboards are
superior for us that code or work in operations all day) has gone main
stream. There are actually quite a few options for new keyboards. A
friend of mine recently got a Das
Keyboard (the completely blank and black
one). I have been looking at the more expensive CODE
Keyboard. The backlit keys and minimalistic
design strike me.
Are they as good as ye olde Model Ms? Less key travel to actuation
perhaps? Not quite as loud? (I’m on the headset a lot with the new job.)
Or are buckling springs the best for a
RSI free life?
As I’ve been thinking about this through the last week…my mouse button
Nothing ever quite stays the same. The best SAs understand this almost
unconsciously. I no longer work for NC State University and have started
with a small company called 42 Lines. Its been a challenge, and I’ve
been drinking from the fire-hose. Instead of teaching and advising I’ve
Clearly, this calls for a website make-over. So, welcome to the new
LinuxCzar powered by Pelican. I’ve been
wanting to move to something other than Wordpress
reasons. Its time to get back to Python, lose the ever annoying comment
spam, and make one less Wordpress install to maintain. Best of all, the
website can live in Git as all things should.
For a Python fan, I am using the Octopress theme
ported to Pelican.
This them has stood out to me as fantastic design for a long time.
Google fonts and the works. However, it does require Ruby tools to
modify the CSS and doesn’t seem to have a support for a lot of
reStructuredText directives. So it may not last. Know of a better theme
However, there is nothing better than being able to write in
I don’t like partitions. MS-DOS partition tables or GUID Partition
Tables (GPT) alike. We use them because you always partition a disk,
right? We use them without understanding the ramifications. The MS-DOS
partition table was designed in 1981. We still blindly use it on almost
every machine today. Do we still surf the Internet with a Commodore
MS-DOS partition tables cannot handle drives of more than 2TiB. Intel
designed GPT in the 1990s to deal with this and add more than 4 (count
‘em, 4) primary partitions. Its better, and is now part of the EFI
standard. For booting your hardware, a partition table can be quite
handy. Or even required.
However, I’ve long advised folks not to wrap partition tables around
their dedicated storage arrays. Growing storage arrays comes up multiple
times a week for me. An outage is required if there are partition
tables in use. A longer outage is needed if we have to convert the
array to GPT to cross the 2 TiB barrier. I have always advised folks
use raw LVM physical volumes on their storage arrays. LVM handles all
the manipulation for us in a smooth and consistent way in this case. No
reboots either as long as you instruct Linux to rescan the SCSI bus
after a resize.
echo 1 > /sys/class/scsi_device/DEVICE/rescan
Then grow the LVM physical volume:
Finally, you may resize the logical volumes and your file systems:
lvresize -l+100%FREE -r /dev/mapper/Volume00-foobar
No outage to handle growth in your dedicated storage array.
With so much infrastructure based on VMs (virtual machines) like KVM or
VMWare we can take these concepts further. Growing a root file system
on a VM isn’t hard and shouldn’t require a Linux expert to handle. Many
cloud providers have tools that do this for their customers. Take into
- VMWare and other VM tools are fully capable of doing all of our
storage virtualization that us Linux folks commonly do with LVM.
- Certain enterprise Linux distributions turn off the kernel’s feature
to re-read partition tables for safety. In any case, the only way
to safely grok a modified partition table is to reboot. We can
manipulate file systems without an outage, so why do we need an
- Depending on the VM’s partitioning and layout, it can require a
large amount of skill to move and resize partitions to extend file
systems. If the VM uses LVM that helps in this case.
- What if, by using a standard method of deployment, your worst case
scenario is your VM guys extend the file space, and the customer is
told to run one or two commands as root? What if by using standard
methods of deployment you could automate this process?
I’ve been attempting to build a better way for us to deploy VMs. Each
of these VM has 3 different virtual disks:
- 512MiB for /boot. This is partitioned with an MS-DOS table so the
machine can boot and the MBR is protected for the bootstrap
procedure. Rarely does /boot need to become larger.
- Swap. 2GiB. Vary according to your needs. Not partitioned! Raw
swap. Depending on memory load, resizing swap can be done online.
- 30GiB, more, or less depending on your needs. Not partitioned.
This is a raw LVM physical volume used to build your logical volumes
for how you would like your file system separated out.
The Red Hat installer doesn’t support this. So creating and using an
image as a template for your VM farm can be very handy. But whether you
use images or, like me, use Kickstart, you need to get the Installer to
actually install this layout. Below are the relevant Kickstart snippets
that will install into the above configuration – at least with RHEL 6.
But once the above can be reached, its a simple matter to grow the third
virtual HD and use
lvresize to extend the native
filesystem without an outage to your systems. (My devices here are
clearpart --drives vda --all
part /boot --size 1 --grow --ondisk vda
volgroup Volume00 vdc --useexisting
logvol / --size 8704 --fstype ext4 --vgname=Volume00 --name=root
logvol /tmp --size 2048 --fstype ext4 --vgname=Volume00 --name=tmp
logvol /var --size 7168 --fstype ext4 --vgname=Volume00 --name=var
# Clean up any possible left overs...
vgremove -v -f Volume00
# Whipe any partition table
dd if=/dev/zero of=/dev/vdb bs=512 count=1
dd if=/dev/zero of=/dev/vdc bs=512 count=1
# Create an LVM Volume Group
pvcreate -y /dev/vdc
vgcreate Volume00 /dev/vdc
# Create swap device
mkswap /dev/vdb -L vmswap
swapon -L vmswap
# Setup swap space in fstab
echo "LABEL=vmswap swap swap defaults 0 0" >> /etc/fstab