I found a StackOverflow post that helped save my day. I’ve been dealing with a bug in Graphite’s carbon-cache.py daemon (version 0.9.12). It hit me badly during a cluster expansion and I know I have hit it before. I had to find it.

This StackOverflow post gave me the magic I needed to signal the process to dump out a Python traceback of all running threads. That was the catch, dumping all threads.

import threading, sys, traceback

def dumpstacks(signal, frame):
    id2name = dict([(th.ident, th.name) for th in threading.enumerate()])
    code = []
    for threadId, stack in sys._current_frames().items():
        code.append("\n# Thread: %s(%d)" % (id2name.get(threadId,""), threadId))
        for filename, lineno, name, line in traceback.extract_stack(stack):
            code.append('File: "%s", line %d, in %s' % (filename, lineno, name))
            if line:
                code.append("  %s" % (line.strip()))
    print "\n".join(code)

import signal
signal.signal(signal.SIGQUIT, dumpstacks)

I changed the print statement to output to Graphite’s log files.

What I knew and what I had:

  • My bug happened after the writer thread hit a unexpected exception in the whisper.update_many() function.
  • I had figured it crashed or somehow hung up the entire writer thread
  • The number of data points in the cache sky rocketed when the bug manifested and all writes to WSP files stopped.
  • Tracing through the code I knew that unexpected exceptions in whisper.file_update_many() bubbled up past whisper.update_many() which skipped closing of that file.
  • I saw via lsof that the process would hold open the bad Whisper file.
  • This happened under pretty heavy load. About 250,000 metrics per minute.
  • I had a really bad load tester.
  • I had beefy test machines.

It took a while bug I figured out how to reproduce the conditions and hit the bug. I had some extra logging in writer.py to help figure out what was going on. But I knew I needed a stack trace of that particular thread when this happened to nail down exactly what was really going on.

So, I reset my test and added the above code to writer.py. A few minutes later I had triggered the bug again.

# kill -sigquit 3460

Suddenly, I had a traceback in carbon-cache-a/console.log.

# Thread: PoolThread-twisted.internet.reactor-1(140363856660224)
File: "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/threading.py", line 504, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
  result = context.call(ctx, function, *args, **kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
  return self.currentContext().callWithContext(ctx, func, *args, **kw)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
  return func(*args,**kw)
File: "/opt/graphite/lib/carbon/writer.py", line 168, in writeForever
  writeCachedDataPoints()
File: "/opt/graphite/lib/carbon/writer.py", line 145, in writeCachedDataPoints
  whisper.update_many(dbFilePath, datapoints)
File: "/opt/graphite/lib/whisper.py", line 577, in update_many
  return file_update_many(fh, points)
File: "/opt/graphite/lib/whisper.py", line 582, in file_update_many
  fcntl.flock( fh.fileno(), fcntl.LOCK_EX )

Also, a graph to further show that I had hit the bug I was looking for.

Finally, I knew what was happening:

  1. I have locking enabled for safety.
  2. On error, the Whisper file wasn’t closed
  3. The Python GC never got around to closing the file – which means it was never unlocked.
  4. The metric came around in the cache again to update.
  5. Boom! The writer thread was deadlocked.

The moral of the story: Always close your files.

Previous | Back | Next


comments powered by Disqus