Scaling Graphite 3: Whisper Bugs
If you run Graphite at scale you are interested in applying this patch.
I was tracing performance issues in my Graphite cluster and saw that for some
queries the backend storage nodes were sending abnormally large pickle
objects back to the Graphite web frontends. Python’s httplib
was taking
several minutes to download the pickle objects causing query times to
skyrocket.
Testing against my backend storage nodes I found that with carefully crafted
time ranges the whisper.py
code would adjust the from and until times so
that they were equal. This case was not detected and resulted in a
read of the entire Whisper database. Only one valid point was returned and
a list of many, many None
values. Example:
curl -v -o /tmp/out.pkl 'http://storage-backendXXX/render/?local=1&format=pickle&from=1444249200&until=1444249440&target=<simple metric glob target>'
The query (identifying bits removed) I was testing with was returning pickle object that were just shy of 50MiB with an M. With the above patch those pickle objects shrink down to about 40KiB. This matched the size of the pickle objects generated with time ranges that included only 1 data point and did not cause the above bug.
These long queries were affecting response times for other queries as well. The following graph shows the difference in performance the patch achieved. The scale on the left is time to retrieve pickle objects from the backend storage nodes in seconds. The scale on the right is the number of retrievals per second.