I found a StackOverflow post that helped save my day. I’ve been dealing with a bug in Graphite’s carbon-cache.py daemon (version 0.9.12). It hit me badly during a cluster expansion and I know I have hit it before. I had to find it.

This StackOverflow post gave me the magic I needed to signal the process to dump out a Python traceback of all running threads. That was the catch, dumping all threads.

import threading, sys, traceback

def dumpstacks(signal, frame):
id2name = dict([(th.ident, th.name) for th in threading.enumerate()])
code = []
for filename, lineno, name, line in traceback.extract_stack(stack):
code.append('File: "%s", line %d, in %s' % (filename, lineno, name))
if line:
code.append("  %s" % (line.strip()))
print "\n".join(code)

import signal
signal.signal(signal.SIGQUIT, dumpstacks)


I changed the print statement to output to Graphite’s log files.

What I knew and what I had:

• My bug happened after the writer thread hit a unexpected exception in the whisper.update_many() function.
• I had figured it crashed or somehow hung up the entire writer thread
• The number of data points in the cache sky rocketed when the bug manifested and all writes to WSP files stopped.
• Tracing through the code I knew that unexpected exceptions in whisper.file_update_many() bubbled up past whisper.update_many() which skipped closing of that file.
• I saw via lsof that the process would hold open the bad Whisper file.
• This happened under pretty heavy load. About 250,000 metrics per minute.
• I had beefy test machines.

It took a while bug I figured out how to reproduce the conditions and hit the bug. I had some extra logging in writer.py to help figure out what was going on. But I knew I needed a stack trace of that particular thread when this happened to nail down exactly what was really going on.

So, I reset my test and added the above code to writer.py. A few minutes later I had triggered the bug again.

# kill -sigquit 3460


Suddenly, I had a traceback in carbon-cache-a/console.log.

# Thread: PoolThread-twisted.internet.reactor-1(140363856660224)
File: "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File: "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
result = context.call(ctx, function, *args, **kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
return func(*args,**kw)
File: "/opt/graphite/lib/carbon/writer.py", line 168, in writeForever
writeCachedDataPoints()
File: "/opt/graphite/lib/carbon/writer.py", line 145, in writeCachedDataPoints
whisper.update_many(dbFilePath, datapoints)
File: "/opt/graphite/lib/whisper.py", line 577, in update_many
return file_update_many(fh, points)
File: "/opt/graphite/lib/whisper.py", line 582, in file_update_many
fcntl.flock( fh.fileno(), fcntl.LOCK_EX )


Also, a graph to further show that I had hit the bug I was looking for.

Finally, I knew what was happening:

1. I have locking enabled for safety.
2. On error, the Whisper file wasn’t closed
3. The Python GC never got around to closing the file – which means it was never unlocked.
4. The metric came around in the cache again to update.

The moral of the story: Always close your files.