Scaling Graphite 3: Whisper Bugs

If you run Graphite at scale you are interested in applying this patch.

I was tracing performance issues in my Graphite cluster and saw that for some queries the backend storage nodes were sending abnormally large pickle objects back to the Graphite web frontends. Python’s httplib was taking several minutes to download the pickle objects causing query times to skyrocket.

Testing against my backend storage nodes I found that with carefully crafted time ranges the whisper.py code would adjust the from and until times so that they were equal. This case was not detected and resulted in a read of the entire Whisper database. Only one valid point was returned and a list of many, many None values. Example:

curl -v -o /tmp/out.pkl 'http://storage-backendXXX/render/?local=1&format=pickle&from=1444249200&until=1444249440&target=<simple metric glob target>'

The query (identifying bits removed) I was testing with was returning pickle object that were just shy of 50MiB with an M. With the above patch those pickle objects shrink down to about 40KiB. This matched the size of the pickle objects generated with time ranges that included only 1 data point and did not cause the above bug.

These long queries were affecting response times for other queries as well. The following graph shows the difference in performance the patch achieved. The scale on the left is time to retrieve pickle objects from the backend storage nodes in seconds. The scale on the right is the number of retrievals per second.

LinuxCzar

Scaling Graphite 3: Whisper Bugs