The most difficult bit about running a Graphite cluster is handling queries or graph rendering during a cluster rebalance. Or after a partitioning event when you use replication in your consistent hashing cluster. Suddenly, graphs under report, have partial data, or might even be completely different when you reload the graph. Generally, your Graphite cluster becomes useless until sanity is restored.
I upgraded my Graphite setup in May to Graphite 0.9.13-ish. Its very close to the top of the 0.9.x branch of the Git repo. This has a bulk-fetch patch that drastically speeds up queries and rendering. It also changes how the webapp decides which metric TimeSeries to use if it gets more than one.
Getting more than one answer for a specific metric is what causes all the pain. This is caused by duplicate Whisper files for the same metric that do not have identical data in them. Exactly what happens during a rebalance. It also happens with replication set higher than 1, but without an outage the Whisper DBs are identical.
In these cases, instead of choosing the “most complete” TimeSeries to use (which causes partial results or under reported results) why not merge them together? Why hasn’t this been done before?
I patched the bulk-fetch CRDT query resolver to do just this. Now I wonder if I can continue to scale Graphite into the petabytes without having to replace the backend with a Cassandra or Riak database?