LinuxCzar

Go, Python, Linux, and Operations. The website of Jack Neely.

I’m just going to call my current client Bruce.

The Graphite cluster I’m working on with Bruce grows almost faster than I can keep up. One of the tools I’ve been transitioning to is carbon-c-relay to do consistent hashing and full metric routing in C. It works very, very well.

However, most of the data Bruce tosses at me utilizes Graphite’s Pickle protocol which carbon-c-relay doesn’t support. I needed an efficient and very fast daemon that could decode Graphite’s Pickle protocol and pass it to carbon-c-relay.

A new tool in the Buckytools project has been born! bucky-pickle-relay listens for Graphite’s Pickle protocol, decodes it using the og-rek Go library and retransmits it to a target relay.

An Upstart configuration for it:

description "bucky-picle-relay accepts Graphite pickle data and outputs text data."
author      "Jack Neely"

start on started carbon-c-relay
stop on stopping carbon-c-relay

setuid graphite
setgid graphite

limit nofile 32768 32768

exec /usr/bin/bucky-pickle-relay -b 0.0.0.0:2004 localhost:2003

Comments and patches (especially patches) welcome.

Comments

I found a StackOverflow post that helped save my day. I’ve been dealing with a bug in Graphite’s carbon-cache.py daemon (version 0.9.12). It hit me badly during a cluster expansion and I know I have hit it before. I had to find it.

This StackOverflow post gave me the magic I needed to signal the process to dump out a Python traceback of all running threads. That was the catch, dumping all threads.

import threading, sys, traceback

def dumpstacks(signal, frame):
    id2name = dict([(th.ident, th.name) for th in threading.enumerate()])
    code = []
    for threadId, stack in sys._current_frames().items():
        code.append("\n# Thread: %s(%d)" % (id2name.get(threadId,""), threadId))
        for filename, lineno, name, line in traceback.extract_stack(stack):
            code.append('File: "%s", line %d, in %s' % (filename, lineno, name))
            if line:
                code.append("  %s" % (line.strip()))
    print "\n".join(code)

import signal
signal.signal(signal.SIGQUIT, dumpstacks)

I changed the print statement to output to Graphite’s log files.

What I knew and what I had:

  • My bug happened after the writer thread hit a unexpected exception in the whisper.update_many() function.
  • I had figured it crashed or somehow hung up the entire writer thread
  • The number of data points in the cache sky rocketed when the bug manifested and all writes to WSP files stopped.
  • Tracing through the code I knew that unexpected exceptions in whisper.file_update_many() bubbled up past whisper.update_many() which skipped closing of that file.
  • I saw via lsof that the process would hold open the bad Whisper file.
  • This happened under pretty heavy load. About 250,000 metrics per minute.
  • I had a really bad load tester.
  • I had beefy test machines.

It took a while bug I figured out how to reproduce the conditions and hit the bug. I had some extra logging in writer.py to help figure out what was going on. But I knew I needed a stack trace of that particular thread when this happened to nail down exactly what was really going on.

So, I reset my test and added the above code to writer.py. A few minutes later I had triggered the bug again.

# kill -sigquit 3460

Suddenly, I had a traceback in carbon-cache-a/console.log.

# Thread: PoolThread-twisted.internet.reactor-1(140363856660224)
File: "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/threading.py", line 504, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
  result = context.call(ctx, function, *args, **kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
  return self.currentContext().callWithContext(ctx, func, *args, **kw)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
  return func(*args,**kw)
File: "/opt/graphite/lib/carbon/writer.py", line 168, in writeForever
  writeCachedDataPoints()
File: "/opt/graphite/lib/carbon/writer.py", line 145, in writeCachedDataPoints
  whisper.update_many(dbFilePath, datapoints)
File: "/opt/graphite/lib/whisper.py", line 577, in update_many
  return file_update_many(fh, points)
File: "/opt/graphite/lib/whisper.py", line 582, in file_update_many
  fcntl.flock( fh.fileno(), fcntl.LOCK_EX )

Also, a graph to further show that I had hit the bug I was looking for.

Finally, I knew what was happening:

  1. I have locking enabled for safety.
  2. On error, the Whisper file wasn’t closed
  3. The Python GC never got around to closing the file – which means it was never unlocked.
  4. The metric came around in the cache again to update.
  5. Boom! The writer thread was deadlocked.

The moral of the story: Always close your files.

Comments

After a bit of Graphite maintenance I had several complaints about graphs being “weird.” Upon inspection they had unusual spikes, data gaps, and negative results where negative numbers made no sense. “L”-shaped spikes describes them the best, I think.

This turned out to be a counter metric, like the number if packets your NIC has received. So, a look at the raw data showed the following:

This is the same data but I’ve removed the nonNegativeDerivative() function. To everyone’s surprise, this counter occasionally decreases! This lead us to a problem with the client reporting the metrics.

Let’s review what nonNegativeDerivative() does:

Same as the derivative function above, but ignores datapoints that trend down. Useful for counters that increase for a long time, then wrap or reset. (Such as if a network interface is destroyed and recreated by unloading and re-loading a kernel module, common with USB / WiFi cards.

So data that trended down was removed from the top graph making it look like it had missing data points. Of course, the bogus data was only amplified by the derivative function.

So, due to the fact we had bogus data coming in, our normal method of showing the rate of change here produced very unusual behavior. This is, however, not a fault or error in Graphite. The real question I ended up being left with is how a restart of the load balancer for the Graphite cluster caused a client or two to misbehave.

Comments

One of my favorite services to run is NTP. The math that makes it work is elegant, graphing NTP’s performance produces beautiful graphs, and NTP is usually a low maintenance service. Most importantly, accurate time synchronization is crucial to every day IT functions. This should be something in every Operation Engineer’s tool bag.

Yet, its always difficult – socially – to make changes to an NTP infrastructure. Every client I’ve worked with has been hesitant to allow changes to their NTP configuration. Many assume that NTP is “simple” and “working” why should it be changed?

Don’t assume that NTP is “simple” and ignore it. Like everything else, one needs a good understanding of how it works to have accurate time synchronization. Here are some tips for running an NTP infrastructure that maintains accurate synchronization and works on all of your machines – including those often drifting virtual machines. At the bottom you’ll find an example NTP configuration.

UTC Is Your Friend

The first tip for setting up your infrastructure to have reasonably accurate time is to set your BIOS clock to UTC. UTC doesn’t have Daylight Savings of other weird time changes. Do not keep your BIOS clock in the local time.

Hierarchy of NTP

NTP servers build a hierarchy and the level you are on in that hierarchy is called your stratum. The smaller the stratum, the more accurate your synchronization should be. However, there are many thousands (or more) of these hierarchies. You can use more than one (and you should). They may overlap upstream from your servers. This is used to create a large amount of redundancy. But, it can also cause hidden single points of failure. Do some work to identify your upstream sources.

For one or a handful of servers, VMs, or workstations directly connected to the Internet, using NTP sources from a pool is best. Your distro most likely has a good starting point for your NTP configuration in /etc/ntp.conf.

For more machines or machines on a backend network you want to setup your own bit of NTP hierarchy. Normal machines will synchronize to your internal NTP servers. You’ll save bandwidth and not abuse upstream resources.

Upstream Sources

3 to 5 upstream NTP sources provides optimum synchronization and protection against failure scenarios. If you are using pool servers be sure to use a pool who’s members are geographically close to your machines. The vendor pool your Linux distro comes with is not geographically close to you. For example, I might use:

server 0.us.pool.ntp.org
server 1.us.pool.ntp.org
server 2.us.pool.ntp.org
server 3.us.pool.ntp.org

When building your internal infrastructure your machines should sync with 3 (perhaps more in some situations) of your own internal NTP servers. Your internal NTP server should sync with 5 upstream sources.

Depending on your resources one or more of which may be a GPS or atomic time source that your place of business can easily acquire and install. Remember, you need at least 3 sources and buying and installing 3 time sources in different data centers (use different brands and sources – not all identical GPS sources) does get expensive. A really good compromise is to buy one time source and use 4 other time upstreams on different Internet networks.

Never, ever use only 2 upstream sources. Let’s look at the failure conditions. Remember, the most common failure condition is that an NTP source is sending the wrong time. Its easy to assume that the most common failure situation is a non-responding upstream – that’s probably the second most common.

  1. One upstream is a single point of failure. You are guaranteeing that your NTP infrastructure will just simply have the wrong time.
  2. Worst case. The NTP algorithm (or humans for that matter) cannot look at two time sources that differ and reliably choose the correct one. Its a 50% guess.
  3. Minimal reliable configuration.
  4. Can tolerate the loss of an NTP server and still have sufficient data to detect falsetickers.

Your business’s NTP servers should use stratum 1 or 2 servers from different networks. You should include your ISP’s NTP server if available to better withstand network outages. If you have your own reference source then, of course, you should have that in your NTP server’s configuration. If you do have your own stratum 1 source you might set your NTP servers to perfer that source. Do not let ordinary machines sync directly from your stratum 1 source.

Like any service you need to monitor your NTP servers for health. Pool members change, NTP servers become overwhelmed. You may need to peridically evaluate your NTP servers upstream sources.

Drift Files

Your NTP clients and servers should have the drift file configured. This records the average drift of your machine’s internal clock compared to the upstream time sources. It is used if your machine cannot reach any NTP servers. NTP also uses this when the daemon first starts. This does help in the failure condition of no reachable NTP servers. Hopefully, a short lived failure condition.

driftfile /var/lib/ntp/ntp.drift

A common issue I see is that the specified directory is missing or the NTP daemon does not have permission to write here. The NTP user should own this directory. Your configuration management system of choice should enforce this.

NTP in VMs, Laptops, and Other Time Stealing Tech

There is a lot of miss-information about how to keep your VMs synchronized, a lot of confusion, and a lot of drifting VMs. Of course, VMs will never be a quality time source – that’s not our goal. Our goal is to reduce the amount that the VM’s internal clock is stepped. (Or completely reset due to a large time difference.) We want our adjustments to slew the clock – this makes a specific second take slightly longer or shorter. Stepping the clock can adversely affect some applications. But in any case we want our VMs to continuously move toward synchronization and not be outside 500 to 1000 milliseconds of sync.

I use NTP on my VMs. VMWare recommends it. Amazon EC2 (Xen HVM) recommends it. Vendors that say they can sync your VM for you are fewer and fewer. Actually, I use the same NTP configuration on all of my servers and keep them identical in this case. With any time-stealing technology you need to instruct your NTP daemon not to panic when it discovers large time differences. NTP has a “safety feature” that is what causes so much pain with keeping VMs in sync and I turn that off.

tinker panic 0

See What Happens If the Reference Time Changes. This is what VMs commonly experience and what I turn off.

A Basic Configuration

So, a basic NTP configuration that I might use on a machine not part of a larger infrastructure is below. This will work on VMs. This trusts the time sources (which has its own article’s worth of ramifications).

# General options
tinker panic 0
## Make sure this directory is owned by NTP
driftfile /var/lib/ntp/drift

# NTP Server Infrastructure
server 0.us.pool.ntp.org
server 1.us.pool.ntp.org
server 2.us.pool.ntp.org
server 3.us.pool.ntp.org

# Access restrictions for this machine
restrict -4 default kod notrap nomodify nopeer noquery
restrict -6 default kod notrap nomodify nopeer noquery
restrict 127.0.0.1
restrict ::1

A Final Note

If you got here, you should really be reading the NTP documentation:

http://www.ntp.org/

Comments

Guilty as charged. I enjoy changing my websites and playing with different technologies more than writing actual content. Things have been very busy, and will be busier yet. In brief, here are some things that need their very own write-up.

Website, Powered By Hugo

I love Python’s RestructuredText markup language which is what I used for my Pelican based website. I, however, was less enthused when none of the themes had any support for RestructuredText’s more “advanced” features. Or anything beyond what Markdown can do. Nor did I want to dig into the Sass to do more in depth work on the theme.

The last 9 months or so I’ve been very enthralled by Go. Simplicity and efficiency make it a winning choice when working and larger scales. I also encountered Hugo and was very interested in the power and flexibility it had for maintaining a website. This led me to re-design the website with Hugo 0.13 and Bootstrap 3.3.2. Its also completely hosted on AWS S3. The only negative I have so far is that I’ve lost my IPv6 presence.

Projects

Git repositories once hosted at http://linuxczar.net/git/ now live in my repositories at GitHub. At least, the still relevant ones.

StatsRelay

StatsRelay, my first real Go project has been remarkably stable and efficient. With it I’m able to handle more than 350,000 packets/metrics per second to my Statsd service. In testing, I’ve been up toward 800,000 packets per second. I haven’t even rebuilt it with Go 1.4.

Whisper-Backup

How do you backup large Graphite clusters? I know folks that run a secondary cluster to mirror data to. That would have been incredibly expensive for me. So why not use OpenStack Swift or Amazon S3? Compression, retention, high speed, locking, and other fine features. Storage format allows for manual restores in an emergency. Check out Whisper-Backup.

Carbontools

Carbontools is just an idea and some bad code right now and probably not its final name either. The biggest problem I have with my Graphite cluster is manipulating data in a sane amount of time. The Python implementation of whisper-fill gets really slow when you need to operate of a few million WSP files.

  • Can I make a whisper-fill that’s an order of magnitude faster?
  • In a rebalance or expansion routine I want a near-atomic method of moving a WSP file. Faster, and decrease query-strangeness that happens in those operations.
  • Perform basic metric manipulations: tar archives, deletes, restores, build the WebUI search index, etc…across large consistent hashing clusters.

In Go, of course.

Today I’m doing these with some Fabric tasks. I’ve far exceeded what Fabric can really do, and the Python/SSH/Python setup at my scale is quite slow.

Family Time

My wife and I expect a baby girl very soon. Very soon. Surely that will add exacting blog posts. Surely.

Comments

Working with a large and consistent hashing Graphite cluster I came across corrupt files. Corrupt files prevent carbon-cache.py from storing data to that specific metric database file. The backlog was starting to tank the cluster. I whipped out find and removed all zero-length files, as that is a common corruption case.

find /opt/graphite/storage/whisper -depth -name *.wsp -size 0c -type f -delete

However, I had a few more cases that were not zero-length files. A quick bit of Google’ing did not find much. Usually, reading the header of the WSP file is enough to have the Whisper code throw an exception, so using that I wrote Whisper-FSCK.

It will scan your tree of Whisper files and look for corrupted ones. With the optional -f argument it will move those files out of the way.

Pull requests welcome!

Comments

Introducing StatsRelay, a proxy daemon for Statsd style metrics written in Go.

What does it do?

StatsRelay is designed to help you scale out your ingestion of Statsd metrics. It is a simple proxy that you send your Statsd metrics to. It will then forward your metrics to a list of backend Statsd daemons. A consistent hashing function is used with each metric name to determine which of the Statsd backends will receive the metric. This ensures that only one Statsd backend daemon is responsible for a specific metric. This prevents Graphite or your upstream time series database from recording partial results.

Why would you use it?

Do you have an application tier of multiple machines that send updates for the same metric into Statsd?

When you need to engineer a scalable Statsd ingestion service you need a way to balance between more than one Statsd daemon. StatsRelay provides that functionality. You can also use multiple StatsRelay daemons behind a UDP load balancer like LVS to further scale out your infrastructure.

StatsRelay is designed to be fast and is the primary reason it is written in Go. The StatsRelay daemon has been benchmarked at handling 200,000 UDP packets per second. It batches the metrics it receives into larger UDP packets before sending them off to the Statsd backends. As the string processing is faster than system calls, this further increases the amount of metrics that each Statsd daemon is able to handle.

When shouldn’t you use StatsRelay?

In many cases you might want to run Statsd on each client machine and let it aggregate and report metrics to Graphite from that point. If each client only produces unique metrics names this is the approach you should use. This doesn’t work, however, when you have multiple machines than need to increment the same counter, for example.

What’s wrong with Statsd?

Etsy’s Statsd tool is really quite excellent. Its written in NodeJS which, event driven it may be, is not what I would call fast. The daemon is a single process which only scales so far. Testing showed that the daemon would drop packets as it approached 40,000 packets per second as it would peg the CPU core it ran on at 100%. I needed a solution for an order of magnitude more traffic.

But, Hey! Statsd comes with a proxy tool!

New versions of Etsy’s Statsd distribution do come with a NodeJS proxy implementation that does much the same thing. Similar to the Statsd daemon the code, in single process mode, would top out around 40,000 packets per second and 100% CPU. Testing showed that the underlying Statsd daemons were not getting all of that traffic either.

I checked back on this proxy after it had been developed further to find that it had a forkCount configuration parameter and what looked like a good start at a multi-process mode. I tested it again with my statsd load generator which produced about 175,000 packets per second, which was well inside the packets per second I needed to support in production. Setting the forkCount to 4 I found 4 processes each consuming 200% CPU and 2G of memory each. The code was still dropping packets.

At about 175,000 packets per second this Go implementation uses about 10M of memory and about 60% CPU. No packets lost.

Contributing

Fork the StatsRelay repository and submit a pull request on GitHub.

Things that need work:

  • Add health checking of the underlying Statsd daemons
  • Profile and tune for speed and packet throughput

Comments

Docker encourages its users to build containers that log to standard out or standard error. In fact, its now common practice to do so. Your process controller (uWSGI/Gunicorn) should combine the logs from all processes and do something useful with them. Like write them all to a file without your app having to worry about locking. Or, maybe even to Syslog.

Docker supports this practice and collects logs for us. In JSON to add missing timestamps and to work well with LogStash. But, what is a show stopping issue for us is that these files grow boundlessly. You cannot use the logrotate utility with them because the Docker daemon will not re-open the file. Well, unless you stop/start the specific container. Docker logging issues are an ongoing topic and this is clearly an area where Docker will improve in the future.

There are two other widely accepted ways of working around this:

  • Bind mount in /dev/log and off load logs to the host’s Syslog
  • Mount a volume from the host or a different container where logs will be processed.

The second point is out. Same problem of not being able to easily tell the app to re-open files for log rotation without restarting the container.

Using /dev/log and off loading logs to the system’s log daemon sounds like a good idea. The Docker host can provide this service arbitrarily to all containers. Containers need not deal with (much) logging complexity inside them.

This approach has multiple problems.

Off loading logs to the host’s Syslog most likely means that you want to add some additional configuration to rsyslog which requires a restart of the rsyslog daemon. (Say, you want to stick your logs in a specific, app-specific file.) The first thing rsyslog does when it starts is (re-)create the /dev/log socket. At this point, any running Docker container that has already bind mounted /dev/log now has an old socket not the newly created one. In any case, rsyslog is no longer listening to any of the currently running containers for logs. Full stop. This method doesn’t pass the smoke test.

What ended up working for me was using the network, but it added complexity to the Docker host. I’m managing Docker hosts with Ansible so this wasn’t a huge problem. I’d rather tune my Docker hosts than alter each image and container. I set the network range on the docker0 bridge interface to a specific and private IP range. Now, my Docker hosts always have a known IP address that my Docker containers can make connections to. In /etc/default/docker:

DOCKER_OPTS="--ip 127.0.0.1 --bip 172.30.0.1/16"

I configured rsyslog on the host to listen for UDP traffic and bind only to this private address:

$ModLoad imudp
$UDPServerRun 514
$UDPServerAddress 172.30.0.1

I then built my image to run the process with its output piped to logger using the -n option to specify my syslog server. Guess what. No logs.

The util-linux in Ubuntu Trusty (and other releases) is 2.20 which dates from 2011-ish. The logger utility has known bugs. Specifically that the -n option is ignored silently unless you also specify a local UNIX socket to write to. This version of util-linux also does not have the nsenter command which is very handy when working with Docker containers either. (See here for nsenter.) This is a pretty big frustration.

The final solution was to make my incantations in my Dockerfiles slightly more complex for apps that do not directly support Syslog. But, it works. :

CMD foobar-server --options 2>&1 \
    | logger -n 172.30.0.1 -u /dev/null -t foobar-server -p local0.notice

I promise I’m not logging to /dev/null.

Comments

I’ve been thinking about and wanting to write about packages for a long time. DEBs. RPMs. Pip for Python. CPAN for Perl. Galaxy for Ansible. Registry and Docker. Puppet modules from Puppet Forge. Vagrant Boxes. Every technology comes with its own distribution format and tool it seems.

My recent transition from RHEL to Ubuntu has made one thing very clear. This mess of packages is intractable. No package format is aware of the others yet they usually have dependencies that interconnect different package types. Pip has no knowledge of C libraries required by many Python packages to build. Us SAs usually end up crossing the streams to produce a working environment. Or we spend hours building packages of one specific type. (Only to spend even more time on them later.) The end result is often different package management systems stepping on each other and producing an unmaintainable and unreproducible system.

I’ve spent, probably, years of my career doing nothing but packaging. The advantages of packages are still just as relevant today as they were in the past. Its a core skill set for running large infrastructures.

Recently, I’ve just about given up trying to deal with packages. Throw-away VMs. Isolation environments. Images. Advanced configuration management tools. Applications with conflicting requirements. Does maintaining a well managed server even matter any more?

I believe it does. A well managed host system keeps things simple and the SAs sane. However, I believe that there should be a line drawn in the sand to keep the OS – and tools that manage the OS – separate from the applications running on that machine or VM. On the OS side of the line, RPMs or DEBs rule. Configuration management has an iron fist. Your configuration management and automation should also deploy your application containers. But now we find the line in the sand.

Your applications, its crazy requirements, as well as whatever abominable package management scheme needed to get the job done should live in Docker containers. Here, your configuration management is a git repo where you can easily rebuild your images. Here, we can use the tools we need that work the best for the situation at hand without causing harm to the host system or another application.

Perhaps Docker “packages” are, finally, the one packaging system to rule them all.

There’s just one thing that itches. I know Fedora out right bans it. Packaging libraries with your applications means that when OpenSSL has a security vulnerability, you have to patch your OS – and find everywhere else that library has been stuffed. Itch. Docker containers seem reasonable about this, but it still means rebuilding and restarting all containers. Itch.

Comments

The last several months have been a deep dive into Ansible. Deterministic. Simplistic. Ideally push based. Uses standard YAML. (I’ve never been much for inventing your own language.) Most of this work has been with Amazon’s EC2 service and Ansible’s included dynamic inventory plugin for EC2 has been indispensable.

However, there’s a lot more this inventory plugin could do. I’ve developed a series of patches, some of which I’ve submitted pull requests for. All of which can be had by cloning the master branch of my GitHub Ansible fork.

  • Do you have more than one AWS account? Need to manage machines in all of them? The multipleaawsaaccounts branch teaches ec2.py to query multiple accounts given the proper IAM credentials.
  • Making groups from various tags and attributes in AWS is handy. But I wanted a way to just make arbitrary groups happen. The arbitraryggroups branch supports reading groups from the ansible_groups AWS tag on your EC2 instances.
  • Need additional information about the EBS stores mapped to your instnaces? The blockddevicemmapping branch exposes this as inventory variables from ec2.py.

Comments