I’m considering swapping out Statsd with Bitly’s statsdaemon for
better performance. But, because Bitly’s version only accepts integer
data I wanted to analyze our Statsd traffic. I figured I’d use my friend
tcpdump to capture some trafic samples and replay them through a test
box for analysis.
# tcpdump -s0 -w /tmp/statsd.pcap udp port 9125
Wireshark confirmed that this was the traffic I was looking for. A spot
check looks like I have good integer data. Now how to dump out the traffic
data so I can at least run grep on it?
The Tcpreplay tools look very powerful. However, it can’t replay TCP
traffic at a server daemon because it cannot synchronize the SYN/ACK numbers
with the real client. But this is UDP taffic! UDP does provide checksums
for data integrity so after changing the IP and MAC address via tcprewrite
I had packets that my Linux box dropped because the checksum didn’t match.
Back to my friend Wireshark:
$ tshark -r /tmp/statsd.pcap -T fields -e data > data
This dumps out newline separated dump of the data field of each packet which
is exactly what I need. Just not as hexadecimal encoded binary data.
importbinasciiimportsysfor s inopen(sys.argv, "r").readlines():
Finally, I have newline separated list of the Statsd metrics in the pcap data
and can finally run grep!
The most difficult bit about running a Graphite cluster is handling queries or
graph rendering during a cluster rebalance. Or after a partitioning event when
you use replication in your consistent hashing cluster. Suddenly, graphs under
report, have partial data, or might even be completely different when you
reload the graph. Generally, your Graphite cluster becomes useless until
sanity is restored.
I upgraded my Graphite setup in May to Graphite 0.9.13-ish. Its very close to
the top of the 0.9.x branch of the Git repo. This has a bulk-fetch patch
that drastically speeds up queries and rendering. It also changes how the
webapp decides which metric TimeSeries to use if it gets more than one.
Getting more than one answer for a specific metric is what causes all the pain.
This is caused by duplicate Whisper files for the same metric that do not have
identical data in them. Exactly what happens during a rebalance. It also
happens with replication set higher than 1, but without an outage the Whisper
DBs are identical.
In these cases, instead of choosing the “most complete” TimeSeries to use
(which causes partial results or under reported results) why not merge
them together? Why hasn’t this been done before?
I patched the bulk-fetch CRDT query resolver to do just this. Now I
wonder if I can continue to scale Graphite into the petabytes without
having to replace the backend with a Cassandra or Riak database?
This was written on request of one of my clients in June.
We use Graphite, Grafana, and Statsd extensively.
We monitor everything from basic server health, detailed MySQL debugging
information, web application transactions, and statistics about the
communication of users’ devices. The number of metrics we generate has grown
exponentially – and sometimes beyond our capacity to plan for them.
In The Beginning
Early on we had a Graphite cluster that grew from an initial machine to a
handful. Many had different attached storage sizes and we used Graphite’s
relay-rules.conf to setup regular expressions matching specific groups of
metrics to a storage location. We grew the cluster by adding “sub-clusters” of
2 or more machines that would use Graphite’s consistent hashing to essentially
build a storage pool. With all the differences, even though we built the entire
system with Puppet, it was difficult to get right.
Our initial problems included bad entries in the CLUSTER_SERVERS setting in the
Graphite web application. Graphs took forever to render as the webapp had to
time out on non-existent Graphite servers. We constantly did battle with which
server would fill up next and how to further break down the metrics into more
groups and add storage and hardware.
Let’s not forget our UDP friend, Statsd. We were receiving upwards of 700,000
UDP packets per second at the one Statsd daemon on the main Graphite relay,
while the Node.js daemon only handled about 40,000 packets per second. The
other 640,000 metrics per second were bring dropped. Testing further showed
that the Node.js code would drop a significant number of metrics as the rate
neared 40,000 packets per second. We knew that Statsd was under-reporting but
we didn’t know how badly until we broke out our network best-friends: wireshark
The requirements of other components of our systems work best on bare metal,
and our hosting provider excels at bare metal as a service. The load balancers
available were hardware solutions and charged by the connection – which just
wasn’t an option for Graphite and Statsd TCP and UDP streams. I also had to be
able to load balance Statsd UDP traffic. That removed HAProxy which is one of
our primary load balancing tools.
However, I have worked with LVS/IPVS load balancing setups for years and I knew
it had the horse power for the job, even on reasonably priced hardware. Of
course, if I send the same Statsd metric to multiple daemons via a simple
round-robin load balancing algorithm, all the daemons will in turn report that
metric to Graphite. With Graphite, when you send duplicate timestamps for the
same metric, the last write wins – which makes your Statsd data useless. Like
carbon-relay.py I needed a Statsd proxy that could run a consistent hash
algorithm on the metric and always direct it to the right Statsd daemon.
I used a tiered load balancing approach to solve these problems. I load
balanced across proxies which, in turn, knew the right host to send data
toward by application specific inspection. This worked out well, because
the proxies couldn’t handle on their own the full traffic load, and the end
points that stored the data turned out to have other limiting issues on their
rate of ingestion.
Uh, what Statsd proxy? The Statsd team had begun to work on a proxy.js to do
just this. Great! However, testing showed the same Node.js limitations with a
maximum of 40,000 packets per second. Scaling to 200,000 packets per second
consumed 8 cores and 8G of RAM and continued to drop packets. A lot of packets.
This was not going to scale.
I have found Go to be a really fantastic language for dealing with operations
tasks. Type safety does wonders to keep bugs out of the code, its nearly as
fast as C, and makes very efficient programs. Scaling isn’t about just
the horizontal or vertical – you also need efficiency.
I created the StatsRelay project to fill this need. Written in Go it is
benchmarked at consuming 250,000 UDP packets per second, no packet drop,
using 10MiB of RAM and about half a CPU core. To help make the underlying
Statsd daemons handle more metrics with less packet loss StatsRelay sends
multiple metrics per packet. String operations are always cheaper than
making system calls. Now this we can work with.
This graph shows the processed Statsd metrics per second flat lined at the
beginning. The green is the rate handled by StatsRelay and shows that
we’ve scaled quite a bit from the capabilities of a single Statsd daemon.
It also helped that we shut down an unintentionally metric heavy source that
wasn’t providing a lot of value.
Graphite Migration #1
When looking at our historical storage consumption, I calculated that we would
continue to grow at about 1.5x that rate. With this information I built a cluster
using these techniques that I estimated would last 2 years. This was October
and November of 2014. Find that on the metric ingestion graph above for the
full humor value. I build some Fabric tasks to use a set of tools called
Carbonate to migrate the data over. Things went smoothly and successfully.
By the end of January the new “two-year” cluster was full. I had roughly some
30 TiB of Whisper files.
Graphite Migration #2
There’s no problem that scaling horizontal wont solve! (You know, the
complete anti-theme of this article.) We ordered more servers. Got them
configured and prepared for the cluster expansion. I tuned my Fabric tasks
to run the Carbonate tools with as much parallelism as I could. (There was
one lunch I remember coming back from to find one of our admin hosts nearly
dead from a test run of the Fabric task.) Tests worked in about 2 days.
Finally, I ran the full rebalancing.
The cluster rebalance took more than 7 days during which query results were
mostly useless. I knew query results would be “weird” until the rebalance
completed – but 7 days was much longer than my worst case time expectations.
My tests had purposely left the new nodes populated. This had the permissions
set correctly and was an attempt to lessen the amount of query weirdness. The
Python implementation of whisper-fill.py took much longer than simply
rsync’ing the data over. 2 to 3 seconds per metric times roughly 2 million
metrics in flight with 4 machines running in parallel is 11 to 17 days. Oops!
Available Whisper storage: 60TiB.
I was having other issues with my Fabric tasks. Wrapping everything in shell
and piping through SSH was starting to break down. The Python implementation
of the fill algorithm was way too slow. The local disk space requirements
to move around metrics were becoming painful. I needed an efficient client /
server API with a high degree of concurrency. Sounds like Go.
I began work on a new project called Buckytools. The server end is a REST
API that can lay down new Whisper files, retrieve Whisper files, run the
fill operation on an existing Whisper file with new data, list metrics on disk,
and be knowledgeable of the consistent hash ring. The client compares hash
rings to make sure they are identical, creates tar archives, restores tar
archives, finds metrics in the wrong locations, and rebalances the cluster.
Also, lots of other administrative tasks for working with a large consistent
hashing Graphite cluster.
There are two key pieces to this project. First, it implements the Graphite
consistent hashing algorithm in Go. Secondly, it also implements the fill
algorithm in Go. With good use of go routines and Go’s amazing net/http
library this turned out to be significantly faster that previous tools.
Graphite Migration #3
Its May, 2015. A new tier of application servers has pushed the cluster into
the red for storage space. Instead of working toward a Summer over-haul,
I needed more space. This was the moment I had been prepping for with
Buckytools. Similar to migration #2, I added another 4 storage nodes and
needed to rebalance about 2.3 million metrics. Also, learning from that
migration I sure sure to clear off any data on the new nodes from my tests.
GOMAXPROCS=4 bucky rebalance -w 100
Speed. 2.3 million metrics in 21 hours. The balance process removed old
metrics on a successful heal resulting in much less query weirdness with
partial data. The migration was going wonderfully well…
Until carbon-cache.py daemons starting crashing due to ever increasing
memory usage. Once I restarted them all I realized that much of the data
received during the 21 hour window had vanished. I examined metric after
metric to see large gaps with the occasional string of data points for this
time window. I was…let’s say I was upset.
Available Whisper storage: 88TiB.
You Wouldn’t Like Me When I’m Angry
This smelled familiar. We’ve encountered this bug before on the Graphite
cluster and chalked it up to a fluke. This time it manifested in every
carbon-cache.py daemon on the new data nodes. Future migrations would
come, so this bug had to go.
Turns out whisper.py that manages the Whisper files on disk doesn’t ensure
that files will be closed. (Versions 0.9.12 and 0.9.13.) We also have
Graphite configured to lock files via a flock() call. When an exception
occurred in whisper.update_many() (called from carbon-cache.py’s
writer thread) the Whisper file would not be closed and in most cases the
Python garbage collector would come along and close the out of scope file
descriptor and unlock the file. However, with enough pressure on the
carbon-cache.py daemon I hit the condition where an exception occurred
but the file descriptor had not been garbage collected before
carbon-cache.py needed to write to that metric again. At this point the
daemon attempted to obtain an exclusive lock on the file and the writer
thread deadlocked. The thread had two open file descriptors to the same
This is fixed in the master branch of the Whisper project and I’m working on
a backport to 0.9.x. After we’ve stress tested a new Graphite cluster setup
I’ll get this in a pull request.
Our Graphite cluster is growing very quickly and we know that adding 4 nodes at
a time every couple months wont scale for long. I’m planning to rebuild the
cluster to scale past the holiday shopping season which is our busiest time of
the year. So, besides just adding more machines, what are we looking at doing?
Statsd: Pull this off the Graphite storage nodes so it can be scaled
independently. We’re running almost 30 daemons to keep up with load and
due to consistent hash ring unevenness and Node.js being what it is there
is still packet loss. Use a more efficient implementation of Statsd. Bitly’s
Statsdaemon looks very promising.
Dedicated Query/Render machines: The user experience has different
scaling requirements that the rest of the cluster. Especially caching.
Evaluate carbon-c-relay: A much more efficient consistent hashing metric
router. Pull all metric routing off of storage nodes. This allows me
to not have storage nodes on the LVS/IPVS subnets since I’m going to have
more nodes than IPs there.
Alternate consistent hashing algorithms: A feature of carbon-c-relay as
I’ve noticed we have quite a few collisions in the hash ring. Graphite’s
consistent hash ring implementation only has 64k slots.
That’s some of our thoughts for the immediate future. There is lots of room
to grow and scale and more challenges and bugs to come. At some point we will
need to consider alternate technologies for storing time series data and how
to port all of our existing line and pickle protocol based tools. Currently,
the Graphite ecosystem is very popular here and gives us much value. We also
think that, with our data patterns, Whisper is a fairly efficient data store
for us. We expect to be scaling Graphite into the future, and writing a few
more articles about it.
The Graphite cluster I’m working on with Bruce grows almost faster than I can
keep up. One of the tools I’ve been transitioning to is carbon-c-relay
to do consistent hashing and full metric routing in C. It works very, very
However, most of the data Bruce tosses at me utilizes Graphite’s Pickle
protocol which carbon-c-relay doesn’t support. I needed an efficient and
very fast daemon that could decode Graphite’s Pickle protocol and pass it
A new tool in the Buckytools project has been born! bucky-pickle-relay
listens for Graphite’s Pickle protocol, decodes it using the og-rek
Go library and retransmits it to a target relay.
An Upstart configuration for it:
description "bucky-picle-relay accepts Graphite pickle data and outputs text data."
author "Jack Neely"
start on started carbon-c-relay
stop on stopping carbon-c-relay
limit nofile 32768 32768
exec /usr/bin/bucky-pickle-relay -b 0.0.0.0:2004 localhost:2003
Comments and patches (especially patches) welcome.
I found a StackOverflow post that helped save my day. I’ve been dealing
with a bug in Graphite’s carbon-cache.py daemon (version 0.9.12). It hit
me badly during a cluster expansion and I know I have hit it before. I
had to find it.
This StackOverflow post gave me the magic I needed to signal the process
to dump out a Python traceback of all running threads. That was the catch,
dumping all threads.
id2name =dict([(th.ident, th.name) for th in threading.enumerate()])
code = 
for threadId, stack in sys._current_frames().items():
code.append("\n# Thread: %s(%d)"% (id2name.get(threadId,""), threadId))
for filename, lineno, name, line in traceback.extract_stack(stack):
code.append('File: "%s", line %d, in %s'% (filename, lineno, name))
code.append(" %s"% (line.strip()))
I changed the print statement to output to Graphite’s log files.
What I knew and what I had:
My bug happened after the writer thread hit a unexpected exception
in the whisper.update_many() function.
I had figured it crashed or somehow hung up the entire writer thread
The number of data points in the cache sky rocketed when the bug
manifested and all writes to WSP files stopped.
Tracing through the code I knew that unexpected exceptions in
whisper.file_update_many() bubbled up past whisper.update_many()
which skipped closing of that file.
I saw via lsof that the process would hold open the bad
This happened under pretty heavy load. About 250,000 metrics per
It took a while bug I figured out how to reproduce the conditions and hit
the bug. I had some extra logging in writer.py to help figure out what
was going on. But I knew I needed a stack trace of that particular
thread when this happened to nail down exactly what was really going on.
So, I reset my test and added the above code to writer.py. A few minutes
later I had triggered the bug again.
# kill -sigquit 3460
Suddenly, I had a traceback in carbon-cache-a/console.log.
# Thread: PoolThread-twisted.internet.reactor-1(140363856660224)
File: "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
File: "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
File: "/usr/lib/python2.7/threading.py", line 504, in run
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
result = context.call(ctx, function, *args, **kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
File: "/opt/graphite/lib/carbon/writer.py", line 168, in writeForever
File: "/opt/graphite/lib/carbon/writer.py", line 145, in writeCachedDataPoints
File: "/opt/graphite/lib/whisper.py", line 577, in update_many
return file_update_many(fh, points)
File: "/opt/graphite/lib/whisper.py", line 582, in file_update_many
fcntl.flock( fh.fileno(), fcntl.LOCK_EX )
Also, a graph to further show that I had hit the bug I was looking for.
Finally, I knew what was happening:
I have locking enabled for safety.
On error, the Whisper file wasn’t closed
The Python GC never got around to closing the file – which means it was
The metric came around in the cache again to update.
After a bit of Graphite maintenance I had several complaints about graphs
being “weird.” Upon inspection they had unusual spikes, data gaps, and
negative results where negative numbers made no sense. “L”-shaped spikes
describes them the best, I think.
This turned out to be a counter metric, like the number if packets
your NIC has received. So, a look at the raw data showed the following:
This is the same data but I’ve removed the nonNegativeDerivative()
function. To everyone’s surprise, this counter occasionally decreases! This
lead us to a problem with the client reporting the metrics.
Let’s review what nonNegativeDerivative() does:
Same as the derivative function above, but ignores datapoints that trend down.
Useful for counters that increase for a long time, then wrap or reset. (Such as
if a network interface is destroyed and recreated by unloading and re-loading a
kernel module, common with USB / WiFi cards.
So data that trended down was removed from the top graph making it look like
it had missing data points. Of course, the bogus data was only amplified
by the derivative function.
So, due to the fact we had bogus data coming in, our normal method of showing
the rate of change here produced very unusual behavior. This is, however,
not a fault or error in Graphite. The real question I ended up being left
with is how a restart of the load balancer for the Graphite cluster caused
a client or two to misbehave.
One of my favorite services to run is NTP. The math that makes it work is
elegant, graphing NTP’s performance produces beautiful graphs, and NTP is
usually a low maintenance service. Most importantly, accurate time
synchronization is crucial to every day IT functions. This should be something
in every Operation Engineer’s tool bag.
Yet, its always difficult – socially – to make changes to an NTP
infrastructure. Every client I’ve worked with has been hesitant to
allow changes to their NTP configuration. Many assume that NTP is
“simple” and “working” why should it be changed?
Don’t assume that NTP is “simple” and ignore it. Like everything else, one
needs a good understanding of how it works to have accurate time
synchronization. Here are some tips for running an NTP infrastructure
that maintains accurate synchronization and works on all of your machines
– including those often drifting virtual machines. At the bottom
you’ll find an example NTP configuration.
UTC Is Your Friend
The first tip for setting up your infrastructure to have reasonably
accurate time is to set your BIOS clock to UTC. UTC doesn’t have
Daylight Savings of other weird time changes. Do not keep your BIOS
clock in the local time.
Hierarchy of NTP
NTP servers build a hierarchy and the level you are on in that hierarchy is
called your stratum. The smaller the stratum, the more accurate your
synchronization should be. However, there are many thousands (or more) of
these hierarchies. You can use more than one (and you should). They may
overlap upstream from your servers. This is used to create a large amount of
redundancy. But, it can also cause hidden single points of failure. Do some
work to identify your upstream sources.
For one or a handful of servers, VMs, or workstations directly connected to the
Internet, using NTP sources from a pool is best. Your distro most likely
has a good starting point for your NTP configuration in /etc/ntp.conf.
For more machines or machines on a backend network you want to setup your own
bit of NTP hierarchy. Normal machines will synchronize to your internal NTP
servers. You’ll save bandwidth and not abuse upstream resources.
3 to 5 upstream NTP sources provides optimum synchronization and protection
against failure scenarios. If you are using pool servers be sure to use a pool
who’s members are geographically close to your machines. The vendor pool your
Linux distro comes with is not geographically close to you. For example, I
When building your internal infrastructure your
machines should sync with 3 (perhaps more in some situations) of your own
internal NTP servers. Your internal NTP server should sync with 5 upstream
Depending on your resources one or more of which may be a GPS or atomic time
source that your place of business can easily acquire and install. Remember,
you need at least 3 sources and buying and installing 3 time sources in
different data centers (use different brands and sources – not all identical
GPS sources) does get expensive. A really good compromise is to buy one time
source and use 4 other time upstreams on different Internet networks.
Never, ever use only 2 upstream sources. Let’s look at the failure conditions.
Remember, the most common failure condition is that an NTP source is sending
the wrong time. Its easy to assume that the most common failure situation is
a non-responding upstream – that’s probably the second most common.
One upstream is a single point of failure. You are guaranteeing that your
NTP infrastructure will just simply have the wrong time.
Worst case. The NTP algorithm (or humans for that matter) cannot look at
two time sources that differ and reliably choose the correct one. Its a 50%
Minimal reliable configuration.
Can tolerate the loss of an NTP server and still have sufficient data
to detect falsetickers.
Your business’s NTP servers should use stratum 1 or 2 servers from different
networks. You should include your ISP’s NTP server if available to better
withstand network outages. If you have your own reference source then, of
course, you should have that in your NTP server’s configuration. If you
do have your own stratum 1 source you might set your NTP servers to perfer
that source. Do not let ordinary machines sync directly from your stratum
Like any service you need to monitor your NTP servers for health. Pool
members change, NTP servers become overwhelmed. You may need to peridically
evaluate your NTP servers upstream sources.
Your NTP clients and servers should have the drift file configured. This
records the average drift of your machine’s internal clock compared to
the upstream time sources. It is used if your machine cannot reach any
NTP servers. NTP also uses this when the daemon first starts. This does
help in the failure condition of no reachable NTP servers. Hopefully,
a short lived failure condition.
A common issue I see is that the specified directory is missing or the NTP
daemon does not have permission to write here. The NTP user should own
this directory. Your configuration management system of choice should
NTP in VMs, Laptops, and Other Time Stealing Tech
There is a lot of miss-information about how to keep your VMs synchronized,
a lot of confusion, and a lot of drifting VMs. Of course, VMs will never
be a quality time source – that’s not our goal. Our goal is to reduce
the amount that the VM’s internal clock is stepped. (Or completely reset
due to a large time difference.) We want our adjustments to slew the
clock – this makes a specific second take slightly longer or shorter.
Stepping the clock can adversely affect some applications. But in any case
we want our VMs to continuously move toward synchronization and not be outside
500 to 1000 milliseconds of sync.
I use NTP on my VMs. VMWare recommends it. Amazon EC2 (Xen HVM) recommends
it. Vendors that say they can sync your VM for you are fewer and fewer.
Actually, I use the same NTP configuration on all of my servers and keep them
identical in this case. With any time-stealing technology you need to instruct
your NTP daemon not to panic when it discovers large time differences. NTP has
a “safety feature” that is what causes so much pain with keeping VMs in sync
and I turn that off.
So, a basic NTP configuration that I might use on a machine not part of a
larger infrastructure is below. This will work on VMs. This trusts
the time sources (which has its own article’s worth of ramifications).
# General options
tinker panic 0
## Make sure this directory is owned by NTP
# NTP Server Infrastructure
# Access restrictions for this machine
restrict -4 default kod notrap nomodify nopeer noquery
restrict -6 default kod notrap nomodify nopeer noquery
A Final Note
If you got here, you should really be reading the NTP documentation:
Guilty as charged. I enjoy changing my websites and playing with different
technologies more than writing actual content. Things have been very
busy, and will be busier yet. In brief, here are some things that need
their very own write-up.
Website, Powered By Hugo
I love Python’s RestructuredText markup language which is what I used for
my Pelican based website. I, however, was less enthused when none of the
themes had any support for RestructuredText’s more “advanced” features. Or
anything beyond what Markdown can do. Nor did I want to dig into the Sass
to do more in depth work on the theme.
The last 9 months or so I’ve been very enthralled by Go. Simplicity and
efficiency make it a winning choice when working and larger scales. I also
encountered Hugo and was very interested in the power and flexibility it had
for maintaining a website. This led me to re-design the website with Hugo
0.13 and Bootstrap 3.3.2. Its also completely hosted on AWS S3. The
only negative I have so far is that I’ve lost my IPv6 presence.
Git repositories once hosted at http://linuxczar.net/git/ now live
in my repositories at GitHub. At least, the still relevant ones.
StatsRelay, my first real Go project has been remarkably stable and
efficient. With it I’m able to handle more than 350,000 packets/metrics
per second to my Statsd service. In testing, I’ve been up toward 800,000
packets per second. I haven’t even rebuilt it with Go 1.4.
How do you backup large Graphite clusters? I know folks that run a secondary
cluster to mirror data to. That would have been incredibly expensive for me.
So why not use OpenStack Swift or Amazon S3? Compression, retention, high
speed, locking, and other fine features. Storage format allows for manual
restores in an emergency. Check out Whisper-Backup.
Carbontools is just an idea and some bad code right now and probably not its
final name either. The biggest problem I have with my Graphite cluster is
manipulating data in a sane amount of time. The Python implementation of
whisper-fill gets really slow when you need to operate of a few million WSP
Can I make a whisper-fill that’s an order of magnitude faster?
In a rebalance or expansion routine I want a near-atomic method of
moving a WSP file. Faster, and decrease query-strangeness that happens
in those operations.
Perform basic metric manipulations: tar archives, deletes, restores,
build the WebUI search index, etc…across large consistent hashing
In Go, of course.
Today I’m doing these with some Fabric tasks. I’ve far exceeded what
Fabric can really do, and the Python/SSH/Python setup at my scale is
My wife and I expect a baby girl very soon. Very soon. Surely that will
add exacting blog posts. Surely.
Working with a large and consistent hashing Graphite cluster I came across
corrupt files. Corrupt files prevent carbon-cache.py from storing data
to that specific metric database file. The backlog was starting to tank
the cluster. I whipped out find and removed all zero-length files, as
that is a common corruption case.
find /opt/graphite/storage/whisper -depth -name *.wsp -size 0c -type f -delete
However, I had a few more cases that were not zero-length files. A quick
bit of Google’ing did not find much. Usually, reading the header
of the WSP file is enough to have the Whisper code throw an exception, so
using that I wrote Whisper-FSCK.
It will scan your tree of Whisper files and look for corrupted ones. With
the optional -f argument it will move those files out of the way.
StatsRelay is designed to help you scale out your ingestion of Statsd
metrics. It is a simple proxy that you send your Statsd metrics to. It
will then forward your metrics to a list of backend Statsd daemons. A
consistent hashing function is used with each metric name to determine
which of the Statsd backends will receive the metric. This ensures that
only one Statsd backend daemon is responsible for a specific metric.
This prevents Graphite or your upstream time series database from
recording partial results.
Why would you use it?
Do you have an application tier of multiple machines that send updates
for the same metric into Statsd?
When you need to engineer a scalable Statsd ingestion service you need a
way to balance between more than one Statsd daemon. StatsRelay provides
that functionality. You can also use multiple StatsRelay daemons behind
a UDP load balancer like LVS to further scale out your infrastructure.
StatsRelay is designed to be fast and is the primary reason it is
written in Go. The StatsRelay daemon has been benchmarked at handling
200,000 UDP packets per second. It batches the metrics it receives into
larger UDP packets before sending them off to the Statsd backends. As
the string processing is faster than system calls, this further
increases the amount of metrics that each Statsd daemon is able to
When shouldn’t you use StatsRelay?
In many cases you might want to run Statsd on each client machine and
let it aggregate and report metrics to Graphite from that point. If each
client only produces unique metrics names this is the approach you
should use. This doesn’t work, however, when you have multiple machines
than need to increment the same counter, for example.
What’s wrong with Statsd?
Etsy’s Statsd tool is really quite excellent. Its written in NodeJS
which, event driven it may be, is not what I would call fast. The daemon
is a single process which only scales so far. Testing showed that the
daemon would drop packets as it approached 40,000 packets per second as
it would peg the CPU core it ran on at 100%. I needed a solution for an
order of magnitude more traffic.
But, Hey! Statsd comes with a proxy tool!
New versions of Etsy’s Statsd distribution do come with a NodeJS proxy
implementation that does much the same thing. Similar to the Statsd
daemon the code, in single process mode, would top out around 40,000
packets per second and 100% CPU. Testing showed that the underlying
Statsd daemons were not getting all of that traffic either.
I checked back on this proxy after it had been developed further to find
that it had a forkCount configuration parameter and what looked like a
good start at a multi-process mode. I tested it again with my statsd
load generator which produced about 175,000 packets per second, which
was well inside the packets per second I needed to support in
production. Setting the forkCount to 4 I found 4 processes each
consuming 200% CPU and 2G of memory each. The code was still dropping
At about 175,000 packets per second this Go implementation uses about
10M of memory and about 60% CPU. No packets lost.
Fork the StatsRelay repository
and submit a pull request on GitHub.
Things that need work:
Add health checking of the underlying Statsd daemons