LinuxCzar

Go, Python, Linux, and Operations. The website of Jack Neely.

I’ve been researching quite a few algorithms for my client, Bruce, as I continue to scale out certain systems. I thought that getting them on my blog would be very useful for a future version of myself and many others. I suspect, and hope, that most folks that work in Systems Administration / Operations will find these at least familiar.

Flap Detection

Flap detection is the ability to determine rapid state changes so that one can take corrective action. The text book example is BGP route flapping where a new route is advertised and then withdrawn (a flap) multiple times in a short period. In general, this “scores” an event with higher scores the more frequent the event.

Warning: Math

For each event, a “penalty” $P$, “timestamp” $t$, and a boolean “isFlapping” variable is stored. $P$ is the current “score” for this event and is $0$ for an event that has not happened. $t$ is the timestamp, usually Unix Epoch time, of the last event.

Each time an event occurs we add a penalty value to the current penalty. The penalty value decays exponentially over time. A suppress limit is known which defines if the event is flapping when the event’s penalty is greater than the suppress limit. A lower reuse limit is known when the event is considered no longer flapping once the penalty value is less than the reuse limit.

Summary

Penalty$P$An event's current score.
Half-Life$h$How much time for half of the penalty to decay.
Timestamp$t$Unix Epoch time of last event.
Supress LimitPenalty > Suppress Limit == Flapping
Reuse LimitPenalty < Reuse Limit && Flapping == Recovery

Calculating for $P$

$$P(t_2) = P(t_1) \times e^{-(t_2 - t_1) \times ln(2) / h}$$

In Pictures

Notes

f(x) = 2.71828 ** (-x*log(2) / 60)
set ylabel "Penalty"
set xlabel "Time"
set key off
set terminal png size 800,600
set output 'output.png'
set label "Reuse Limit" at 40,11
set label "Suppress Limit" at 100,26
plot [0:240] x < 10 ? f(x) : x < 20 ? f(x-10)*10 : x < 30 ? f(x-20)*20 : f(x-30)*30, 10 with dots, 25 with dots

Comments

This is a update of an old post so its back at the top of the blog. Original posting was 2015-08-19.

I’m considering swapping out Statsd with Bitly’s statsdaemon for better performance. But, because Bitly’s version only accepts integer data I wanted to analyze our Statsd traffic. I figured I’d use my friend tcpdump to capture some trafic samples and replay them through a test box for analysis. Also, figuring out what are our hot metrics is very handy.

# tcpdump -s0 -w /tmp/statsd.pcap udp port 9125

Wireshark confirmed that this was the traffic I was looking for. A spot check looks like I have good integer data. How to dump out the traffic data so I can at least run grep and other common unix tools on the text data?

The Tcpreplay tools look very powerful. However, it can’t replay TCP traffic at a server daemon because it cannot synchronize the SYN/ACK numbers with the real client. But this is UDP taffic! UDP does provide checksums for data integrity so after changing the IP and MAC address via tcprewrite I had packets that my Linux box dropped because the checksum didn’t match.

Back to my friend Wireshark:

$ tshark -r /tmp/statsd.pcap -T fields -e data > data

This dumps out newline separated dump of the data field of each packet which is exactly what I need. Just not as hexadecimal encoded binary data.

import binascii
import sys

for s in open(sys.argv[1], "r").readlines():
    print binascii.unhexlify(s.strip())

Finally, I have newline separated list of the Statsd metrics in the pcap data and can finally run grep!

$ python unhex.py data | gawk -F: '/.+/ { print $1 }' | sort | uniq -c | sort -n

Now I also have a frequency distribution chart of the packet capture showing me what the most common metrics are.

Comments

Life is busy around the holidays with much family and new family additions. If there is one thing I wish I could remind the world it is that Christmas does last 12 days from Christmas Day until January 6th. So, Merry Christmas and a Happy 2016 to all! Here are some updates in no particular order.

Google Jump Hash Support in Buckytools

It looks like my Graphite cluster will be used for long term storage as we migrate toward Prometheus – which means I get no relief in my ever-scaling cluster. So, I’m adding Google’s Jump Hash to Buckytools. With this hashing algorithm you cannot remove a node in the middle of the hash ring without affecting all nodes afterwards in the ring so full support of Graphite’s replication factor will find its way into Buckytools as well. If I’ve not merged yet, take a look at the jump branch. The plan here is to be directly compatible with the FNV1a-Jump hashing now implemented in carbon-c-relay.

New Graphite Storage Backend

To support my Graphite work I’m moving toward a new storage backend that uses a Ceres-like approach. One or multiple files per metric using a columnar data format so the files grow based on the number of data points stored. Implementing this in Go will give incredible performance improvements and the new file format will give a marked reduction in storage space used. Some code for this is in my new Journal project.

Also, key to this project is being able to regularly compress older data. There are some interesting choices here that aught to help a lot with my storage requirements, but they make the archives not randomly accessible. Doing this for data past a certain age probably makes since.

This is probably the most challenging aspect of my plans for better Graphite scaling. Will I get the time? Will a better solution present itself? Will a entirely different method of handling telemetry that will get in the way of my plans of using Prometheus for our operational telemetry?

Graphite Metric Generator

Quick and dirty scaling tests for Graphite. This small tool will generate random metrics with random data, or just random data for a known list of metric names. The gentestmetrics command is part of Buckytools.

Usage of ./gentestmetrics:
Copyright 2015 42 Lines, Inc.
Original Author: Jack Neely

  -c int
        Number of random metrics to generate. (default 1000)
  -f string
        Use metric names found in this file.  Overrides -c.
  -i int
        Seconds between each timestamp/value pair per metric. (default 60)
  -l int
        Upper limit to randomly generated integer values. (default 100)
  -p string
        Prefix prepended to each metric name. (default "test.")
  -s int
        Time stamp to start -- iterations go BACKWARDS from here (default now)
  -v int
        Use the given value rather than a random integer.
  -x int
        How many timestamp/value pairs for each generated metric name. (default 1)

Metrics are printed to STDOUT. The most interesting usage is generating data for a list of known metrics delimited by newlines in a text file. Say with a 1 minute interval and 5 data points each:

./gentestmetrics -f metrics.txt -i 60 -x 5

The generated metrics are output in chronological order. However, the timestamp given via -s dictates the timestamp of the last data point rather than the first.

Is There Anyone Not Podcasting?

A few friends and I started the Practical Operations Podcast. We talk about the practical side of operations work in small to large scale environments and the not so obvious implications of various tools and how you might use them in real world conditions. Check us out!

Comments

I’ve been thinking about the “future” and how I can move my metrics system from what I have now to what I’d like to have. I run a large Graphite cluster. In the 26 million metrics per minute range with a quarter petabyte of provisioned storage. I integrate with a Naemon (Nagios fork) and Merlin setup for alerting.

I’ve been following Prometheus for a year and wondering about what the future might be like. Turns out, my fellow Operations team members and the Developers are also highly interested in Prometheus or a tool that offers Prometheus-like features. Specifically:

  • Support ephemeral hosts: Be smarter about how metrics are managed so that each host adds metric data without polluting the namespace with thousands of host entries.
  • Scale storage: No more Whisper files, storage needs to scale based on the timestamp/value pairs we store rather than a pre-allocated chunk of disk space.
  • Scale to a multi- data center environment: Graphite isn’t designed to make multiple clusters in different data centers of regions work well together. Although, modern versions of Grafana can really help there. Prometheus handles this style of sharding natively.
  • Ability to tag or label metrics: This makes ephemeral hosts work well combined with storage allocated as needed (rather than allocating all possible storage at once).
  • Support advanced metric based alerting: A strength of Prometheus and we can funnel through our Nagios-based monitoring to deal with pager groups etc.

So, how does one get from a monolithic Graphite setup to something like the above? A good question that I’m still trying to work out. Here’s what I’m thinking.

Global Systems:

  • Keep our Nagios based alerting system. It routes alerts, handles paging time periods, and, most importantly, handles alerts from many different sources via checks. Uses PagerDuty, email, etc. as appropriate.
  • Keep the current check_graphite code we are using to do metric based alerting. It enables us to transition when we can and roll back if needed.
  • Setup a Prometheus / AlertManager instance for any global aggregation and handle routing of alerts from Prometheus metric based checks to Nagios.
  • Upgrade Grafana to 2.5 (or better) to be the global user interface to metrics and be able to pull data from many different sources: Graphite, Prometheus, and Elasticsearch.
  • Scale Graphite storage with some form of black magic.

Sharded Systems: These systems are the infrastructure setup as part of each data center or region.

  • A Prometheus server to scrape metrics from local systems and services. Each Prometheus server maps and forwards data points to Graphite. Perhaps an identical second server for redundancy. Alerts and aggregate metrics flow upward toward the global Prometheus service.
  • A local Graphite/Statsd ingestion service found by service discovery to handle and route old school metrics.

The design of this gives me a Prometheus system we can used for advanced alerting and short term monitoring of metrics with great support for ephemeral hosts and labeling. Graphite still collects unconverted metrics and holds our historical or long term data. Graphite also serves as a long term storage option for Prometheus. (See this patch.)

What’s left unsolved? The hard parts:

  • Long term metric storage must scale and Whisper files isn’t cutting it. I need to spend some time with alternate Graphite backends or in writing one. Many of the existing options bring along their own challenges. I am required to keep full resolution data for years.

I have some ideas here. I had hopes for InfluxDB but it does not appear stable. But, I’m thinking something far simpler. More to come here.

Will this work? Will this scale to 20 million metrics or more? Perhaps its worth finding out.

Comments

If you run Graphite at scale you are interested in applying this patch.

I was tracing performance issues in my Graphite cluster and saw that for some queries the backend storage nodes were sending abnormally large pickle objects back to the Graphite web frontends. Python’s httplib was taking several minutes to download the pickle objects causing query times to skyrocket.

Testing against my backend storage nodes I found that with carefully crafted time ranges the whisper.py code would adjust the from and until times so that they were equal. This case was not detected and resulted in a read of the entire Whisper database. Only one valid point was returned and a list of many, many None values. Example:

curl -v -o /tmp/out.pkl 'http://storage-backendXXX/render/?local=1&format=pickle&from=1444249200&until=1444249440&target=<simple metric glob target>'

The query (identifying bits removed) I was testing with was returning pickle object that were just shy of 50MiB with an M. With the above patch those pickle objects shrink down to about 40KiB. This matched the size of the pickle objects generated with time ranges that included only 1 data point and did not cause the above bug.

These long queries were affecting response times for other queries as well. The following graph shows the difference in performance the patch achieved. The scale on the left is time to retrieve pickle objects from the backend storage nodes in seconds. The scale on the right is the number of retrievals per second.

Graphite Retrieval Times

Comments

The most difficult bit about running a Graphite cluster is handling queries or graph rendering during a cluster rebalance. Or after a partitioning event when you use replication in your consistent hashing cluster. Suddenly, graphs under report, have partial data, or might even be completely different when you reload the graph. Generally, your Graphite cluster becomes useless until sanity is restored.

I upgraded my Graphite setup in May to Graphite 0.9.13-ish. Its very close to the top of the 0.9.x branch of the Git repo. This has a bulk-fetch patch that drastically speeds up queries and rendering. It also changes how the webapp decides which metric TimeSeries to use if it gets more than one.

Getting more than one answer for a specific metric is what causes all the pain. This is caused by duplicate Whisper files for the same metric that do not have identical data in them. Exactly what happens during a rebalance. It also happens with replication set higher than 1, but without an outage the Whisper DBs are identical.

In these cases, instead of choosing the “most complete” TimeSeries to use (which causes partial results or under reported results) why not merge them together? Why hasn’t this been done before?

I patched the bulk-fetch CRDT query resolver to do just this. Now I wonder if I can continue to scale Graphite into the petabytes without having to replace the backend with a Cassandra or Riak database?

Comments

This was written on request of one of my clients in June.

We use Graphite, Grafana, and Statsd extensively. We monitor everything from basic server health, detailed MySQL debugging information, web application transactions, and statistics about the communication of users’ devices. The number of metrics we generate has grown exponentially – and sometimes beyond our capacity to plan for them.

Exponential Growth of Graphite Metrics

In The Beginning

Early on we had a Graphite cluster that grew from an initial machine to a handful. Many had different attached storage sizes and we used Graphite’s relay-rules.conf to setup regular expressions matching specific groups of metrics to a storage location. We grew the cluster by adding “sub-clusters” of 2 or more machines that would use Graphite’s consistent hashing to essentially build a storage pool. With all the differences, even though we built the entire system with Puppet, it was difficult to get right.

Our initial problems included bad entries in the CLUSTER_SERVERS setting in the Graphite web application. Graphs took forever to render as the webapp had to time out on non-existent Graphite servers. We constantly did battle with which server would fill up next and how to further break down the metrics into more groups and add storage and hardware.

Let’s not forget our UDP friend, Statsd. We were receiving upwards of 700,000 UDP packets per second at the one Statsd daemon on the main Graphite relay, while the Node.js daemon only handled about 40,000 packets per second. The other 640,000 metrics per second were bring dropped. Testing further showed that the Node.js code would drop a significant number of metrics as the rate neared 40,000 packets per second. We knew that Statsd was under-reporting but we didn’t know how badly until we broke out our network best-friends: wireshark and tcpdump.

Load Balancing

The requirements of other components of our systems work best on bare metal, and our hosting provider excels at bare metal as a service. The load balancers available were hardware solutions and charged by the connection – which just wasn’t an option for Graphite and Statsd TCP and UDP streams. I also had to be able to load balance Statsd UDP traffic. That removed HAProxy which is one of our primary load balancing tools.

However, I have worked with LVS/IPVS load balancing setups for years and I knew it had the horse power for the job, even on reasonably priced hardware. Of course, if I send the same Statsd metric to multiple daemons via a simple round-robin load balancing algorithm, all the daemons will in turn report that metric to Graphite. With Graphite, when you send duplicate timestamps for the same metric, the last write wins – which makes your Statsd data useless. Like carbon-relay.py I needed a Statsd proxy that could run a consistent hash algorithm on the metric and always direct it to the right Statsd daemon.

I used a tiered load balancing approach to solve these problems. I load balanced across proxies which, in turn, knew the right host to send data toward by application specific inspection. This worked out well, because the proxies couldn’t handle on their own the full traffic load, and the end points that stored the data turned out to have other limiting issues on their rate of ingestion.

Load Balancing Diagram

Project: StatsRelay

Uh, what Statsd proxy? The Statsd team had begun to work on a proxy.js to do just this. Great! However, testing showed the same Node.js limitations with a maximum of 40,000 packets per second. Scaling to 200,000 packets per second consumed 8 cores and 8G of RAM and continued to drop packets. A lot of packets. This was not going to scale.

I have found Go to be a really fantastic language for dealing with operations tasks. Type safety does wonders to keep bugs out of the code, its nearly as fast as C, and makes very efficient programs. Scaling isn’t about just the horizontal or vertical – you also need efficiency.

I created the StatsRelay project to fill this need. Written in Go it is benchmarked at consuming 250,000 UDP packets per second, no packet drop, using 10MiB of RAM and about half a CPU core. To help make the underlying Statsd daemons handle more metrics with less packet loss StatsRelay sends multiple metrics per packet. String operations are always cheaper than making system calls. Now this we can work with.

Processed Statsd Metrics Before and After StatsRelay

This graph shows the processed Statsd metrics per second flat lined at the beginning. The green is the rate handled by StatsRelay and shows that we’ve scaled quite a bit from the capabilities of a single Statsd daemon. It also helped that we shut down an unintentionally metric heavy source that wasn’t providing a lot of value.

Graphite Migration #1

When looking at our historical storage consumption, I calculated that we would continue to grow at about 1.5x that rate. With this information I built a cluster using these techniques that I estimated would last 2 years. This was October and November of 2014. Find that on the metric ingestion graph above for the full humor value. I build some Fabric tasks to use a set of tools called Carbonate to migrate the data over. Things went smoothly and successfully.

By the end of January the new “two-year” cluster was full. I had roughly some 30 TiB of Whisper files.

Graphite Migration #2

There’s no problem that scaling horizontal wont solve! (You know, the complete anti-theme of this article.) We ordered more servers. Got them configured and prepared for the cluster expansion. I tuned my Fabric tasks to run the Carbonate tools with as much parallelism as I could. (There was one lunch I remember coming back from to find one of our admin hosts nearly dead from a test run of the Fabric task.) Tests worked in about 2 days. Finally, I ran the full rebalancing.

The cluster rebalance took more than 7 days during which query results were mostly useless. I knew query results would be “weird” until the rebalance completed – but 7 days was much longer than my worst case time expectations.

My tests had purposely left the new nodes populated. This had the permissions set correctly and was an attempt to lessen the amount of query weirdness. The Python implementation of whisper-fill.py took much longer than simply rsync’ing the data over. 2 to 3 seconds per metric times roughly 2 million metrics in flight with 4 machines running in parallel is 11 to 17 days. Oops!

Available Whisper storage: 60TiB.

Project: Buckytools

I was having other issues with my Fabric tasks. Wrapping everything in shell and piping through SSH was starting to break down. The Python implementation of the fill algorithm was way too slow. The local disk space requirements to move around metrics were becoming painful. I needed an efficient client / server API with a high degree of concurrency. Sounds like Go.

I began work on a new project called Buckytools. The server end is a REST API that can lay down new Whisper files, retrieve Whisper files, run the fill operation on an existing Whisper file with new data, list metrics on disk, and be knowledgeable of the consistent hash ring. The client compares hash rings to make sure they are identical, creates tar archives, restores tar archives, finds metrics in the wrong locations, and rebalances the cluster. Also, lots of other administrative tasks for working with a large consistent hashing Graphite cluster.

There are two key pieces to this project. First, it implements the Graphite consistent hashing algorithm in Go. Secondly, it also implements the fill algorithm in Go. With good use of go routines and Go’s amazing net/http library this turned out to be significantly faster that previous tools.

Graphite Migration #3

Its May, 2015. A new tier of application servers has pushed the cluster into the red for storage space. Instead of working toward a Summer over-haul, I needed more space. This was the moment I had been prepping for with Buckytools. Similar to migration #2, I added another 4 storage nodes and needed to rebalance about 2.3 million metrics. Also, learning from that migration I sure sure to clear off any data on the new nodes from my tests.

GOMAXPROCS=4 bucky rebalance -w 100

Speed. 2.3 million metrics in 21 hours. The balance process removed old metrics on a successful heal resulting in much less query weirdness with partial data. The migration was going wonderfully well…

Until carbon-cache.py daemons starting crashing due to ever increasing memory usage. Once I restarted them all I realized that much of the data received during the 21 hour window had vanished. I examined metric after metric to see large gaps with the occasional string of data points for this time window. I was…let’s say I was upset.

Example of Lost Data Points

Available Whisper storage: 88TiB.

You Wouldn’t Like Me When I’m Angry

This smelled familiar. We’ve encountered this bug before on the Graphite cluster and chalked it up to a fluke. This time it manifested in every carbon-cache.py daemon on the new data nodes. Future migrations would come, so this bug had to go.

Turns out whisper.py that manages the Whisper files on disk doesn’t ensure that files will be closed. (Versions 0.9.12 and 0.9.13.) We also have Graphite configured to lock files via a flock() call. When an exception occurred in whisper.update_many() (called from carbon-cache.py’s writer thread) the Whisper file would not be closed and in most cases the Python garbage collector would come along and close the out of scope file descriptor and unlock the file. However, with enough pressure on the carbon-cache.py daemon I hit the condition where an exception occurred but the file descriptor had not been garbage collected before carbon-cache.py needed to write to that metric again. At this point the daemon attempted to obtain an exclusive lock on the file and the writer thread deadlocked. The thread had two open file descriptors to the same Whisper file.

This is fixed in the master branch of the Whisper project and I’m working on a backport to 0.9.x. After we’ve stress tested a new Graphite cluster setup I’ll get this in a pull request.

The Future

Our Graphite cluster is growing very quickly and we know that adding 4 nodes at a time every couple months wont scale for long. I’m planning to rebuild the cluster to scale past the holiday shopping season which is our busiest time of the year. So, besides just adding more machines, what are we looking at doing?

  • Statsd: Pull this off the Graphite storage nodes so it can be scaled independently. We’re running almost 30 daemons to keep up with load and due to consistent hash ring unevenness and Node.js being what it is there is still packet loss. Use a more efficient implementation of Statsd. Bitly’s Statsdaemon looks very promising.
  • Dedicated Query/Render machines: The user experience has different scaling requirements that the rest of the cluster. Especially caching.
  • Evaluate carbon-c-relay: A much more efficient consistent hashing metric router. Pull all metric routing off of storage nodes. This allows me to not have storage nodes on the LVS/IPVS subnets since I’m going to have more nodes than IPs there.
  • Alternate consistent hashing algorithms: A feature of carbon-c-relay as I’ve noticed we have quite a few collisions in the hash ring. Graphite’s consistent hash ring implementation only has 64k slots.

That’s some of our thoughts for the immediate future. There is lots of room to grow and scale and more challenges and bugs to come. At some point we will need to consider alternate technologies for storing time series data and how to port all of our existing line and pickle protocol based tools. Currently, the Graphite ecosystem is very popular here and gives us much value. We also think that, with our data patterns, Whisper is a fairly efficient data store for us. We expect to be scaling Graphite into the future, and writing a few more articles about it.

Comments

I’m just going to call my current client Bruce.

The Graphite cluster I’m working on with Bruce grows almost faster than I can keep up. One of the tools I’ve been transitioning to is carbon-c-relay to do consistent hashing and full metric routing in C. It works very, very well.

However, most of the data Bruce tosses at me utilizes Graphite’s Pickle protocol which carbon-c-relay doesn’t support. I needed an efficient and very fast daemon that could decode Graphite’s Pickle protocol and pass it to carbon-c-relay.

A new tool in the Buckytools project has been born! bucky-pickle-relay listens for Graphite’s Pickle protocol, decodes it using the og-rek Go library and retransmits it to a target relay.

An Upstart configuration for it:

description "bucky-picle-relay accepts Graphite pickle data and outputs text data."
author      "Jack Neely"

start on started carbon-c-relay
stop on stopping carbon-c-relay

setuid graphite
setgid graphite

limit nofile 32768 32768

exec /usr/bin/bucky-pickle-relay -b 0.0.0.0:2004 localhost:2003

Comments and patches (especially patches) welcome.

Comments

I found a StackOverflow post that helped save my day. I’ve been dealing with a bug in Graphite’s carbon-cache.py daemon (version 0.9.12). It hit me badly during a cluster expansion and I know I have hit it before. I had to find it.

This StackOverflow post gave me the magic I needed to signal the process to dump out a Python traceback of all running threads. That was the catch, dumping all threads.

import threading, sys, traceback

def dumpstacks(signal, frame):
    id2name = dict([(th.ident, th.name) for th in threading.enumerate()])
    code = []
    for threadId, stack in sys._current_frames().items():
        code.append("\n# Thread: %s(%d)" % (id2name.get(threadId,""), threadId))
        for filename, lineno, name, line in traceback.extract_stack(stack):
            code.append('File: "%s", line %d, in %s' % (filename, lineno, name))
            if line:
                code.append("  %s" % (line.strip()))
    print "\n".join(code)

import signal
signal.signal(signal.SIGQUIT, dumpstacks)

I changed the print statement to output to Graphite’s log files.

What I knew and what I had:

  • My bug happened after the writer thread hit a unexpected exception in the whisper.update_many() function.
  • I had figured it crashed or somehow hung up the entire writer thread
  • The number of data points in the cache sky rocketed when the bug manifested and all writes to WSP files stopped.
  • Tracing through the code I knew that unexpected exceptions in whisper.file_update_many() bubbled up past whisper.update_many() which skipped closing of that file.
  • I saw via lsof that the process would hold open the bad Whisper file.
  • This happened under pretty heavy load. About 250,000 metrics per minute.
  • I had a really bad load tester.
  • I had beefy test machines.

It took a while bug I figured out how to reproduce the conditions and hit the bug. I had some extra logging in writer.py to help figure out what was going on. But I knew I needed a stack trace of that particular thread when this happened to nail down exactly what was really going on.

So, I reset my test and added the above code to writer.py. A few minutes later I had triggered the bug again.

# kill -sigquit 3460

Suddenly, I had a traceback in carbon-cache-a/console.log.

# Thread: PoolThread-twisted.internet.reactor-1(140363856660224)
File: "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/threading.py", line 504, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
  result = context.call(ctx, function, *args, **kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
  return self.currentContext().callWithContext(ctx, func, *args, **kw)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
  return func(*args,**kw)
File: "/opt/graphite/lib/carbon/writer.py", line 168, in writeForever
  writeCachedDataPoints()
File: "/opt/graphite/lib/carbon/writer.py", line 145, in writeCachedDataPoints
  whisper.update_many(dbFilePath, datapoints)
File: "/opt/graphite/lib/whisper.py", line 577, in update_many
  return file_update_many(fh, points)
File: "/opt/graphite/lib/whisper.py", line 582, in file_update_many
  fcntl.flock( fh.fileno(), fcntl.LOCK_EX )

Also, a graph to further show that I had hit the bug I was looking for.

Finally, I knew what was happening:

  1. I have locking enabled for safety.
  2. On error, the Whisper file wasn’t closed
  3. The Python GC never got around to closing the file – which means it was never unlocked.
  4. The metric came around in the cache again to update.
  5. Boom! The writer thread was deadlocked.

The moral of the story: Always close your files.

Comments

After a bit of Graphite maintenance I had several complaints about graphs being “weird.” Upon inspection they had unusual spikes, data gaps, and negative results where negative numbers made no sense. “L”-shaped spikes describes them the best, I think.

This turned out to be a counter metric, like the number if packets your NIC has received. So, a look at the raw data showed the following:

This is the same data but I’ve removed the nonNegativeDerivative() function. To everyone’s surprise, this counter occasionally decreases! This lead us to a problem with the client reporting the metrics.

Let’s review what nonNegativeDerivative() does:

Same as the derivative function above, but ignores datapoints that trend down. Useful for counters that increase for a long time, then wrap or reset. (Such as if a network interface is destroyed and recreated by unloading and re-loading a kernel module, common with USB / WiFi cards.

So data that trended down was removed from the top graph making it look like it had missing data points. Of course, the bogus data was only amplified by the derivative function.

So, due to the fact we had bogus data coming in, our normal method of showing the rate of change here produced very unusual behavior. This is, however, not a fault or error in Graphite. The real question I ended up being left with is how a restart of the load balancer for the Graphite cluster caused a client or two to misbehave.

Comments