Debugging Python or Deadlocks in Graphite

I found a StackOverflow post that helped save my day. I’ve been dealing with a bug in Graphite’s carbon-cache.py daemon (version 0.9.12). It hit me badly during a cluster expansion and I know I have hit it before. I had to find it.

This StackOverflow post gave me the magic I needed to signal the process to dump out a Python traceback of all running threads. That was the catch, dumping all threads.

import threading, sys, traceback

def dumpstacks(signal, frame):
    id2name = dict([(th.ident, th.name) for th in threading.enumerate()])
    code = []
    for threadId, stack in sys._current_frames().items():
        code.append("\n# Thread: %s(%d)" % (id2name.get(threadId,""), threadId))
        for filename, lineno, name, line in traceback.extract_stack(stack):
            code.append('File: "%s", line %d, in %s' % (filename, lineno, name))
            if line:
                code.append("  %s" % (line.strip()))
    print "\n".join(code)

import signal
signal.signal(signal.SIGQUIT, dumpstacks)

I changed the print statement to output to Graphite’s log files.

What I knew and what I had:

My bug happened after the writer thread hit a unexpected exception in the whisper.update_many() function.
I had figured it crashed or somehow hung up the entire writer thread
The number of data points in the cache sky rocketed when the bug manifested and all writes to WSP files stopped.
Tracing through the code I knew that unexpected exceptions in whisper.file_update_many() bubbled up past whisper.update_many() which skipped closing of that file.
I saw via lsof that the process would hold open the bad Whisper file.
This happened under pretty heavy load. About 250,000 metrics per minute.
I had a really bad load tester.
I had beefy test machines.

It took a while bug I figured out how to reproduce the conditions and hit the bug. I had some extra logging in writer.py to help figure out what was going on. But I knew I needed a stack trace of that particular thread when this happened to nail down exactly what was really going on.

So, I reset my test and added the above code to writer.py. A few minutes later I had triggered the bug again.

# kill -sigquit 3460

Suddenly, I had a traceback in carbon-cache-a/console.log.

# Thread: PoolThread-twisted.internet.reactor-1(140363856660224)
File: "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/threading.py", line 504, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
  result = context.call(ctx, function, *args, **kwargs)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
  return self.currentContext().callWithContext(ctx, func, *args, **kw)
File: "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
  return func(*args,**kw)
File: "/opt/graphite/lib/carbon/writer.py", line 168, in writeForever
  writeCachedDataPoints()
File: "/opt/graphite/lib/carbon/writer.py", line 145, in writeCachedDataPoints
  whisper.update_many(dbFilePath, datapoints)
File: "/opt/graphite/lib/whisper.py", line 577, in update_many
  return file_update_many(fh, points)
File: "/opt/graphite/lib/whisper.py", line 582, in file_update_many
  fcntl.flock( fh.fileno(), fcntl.LOCK_EX )

Also, a graph to further show that I had hit the bug I was looking for.

Finally, I knew what was happening:

I have locking enabled for safety.
On error, the Whisper file wasn’t closed
The Python GC never got around to closing the file – which means it was never unlocked.
The metric came around in the cache again to update.
Boom! The writer thread was deadlocked.

The moral of the story: Always close your files.

LinuxCzar

Debugging Python or Deadlocks in Graphite