What does it do?
StatsRelay is designed to help you scale out your ingestion of Statsd metrics. It is a simple proxy that you send your Statsd metrics to. It will then forward your metrics to a list of backend Statsd daemons. A consistent hashing function is used with each metric name to determine which of the Statsd backends will receive the metric. This ensures that only one Statsd backend daemon is responsible for a specific metric. This prevents Graphite or your upstream time series database from recording partial results.
Why would you use it?
Do you have an application tier of multiple machines that send updates for the same metric into Statsd?
When you need to engineer a scalable Statsd ingestion service you need a way to balance between more than one Statsd daemon. StatsRelay provides that functionality. You can also use multiple StatsRelay daemons behind a UDP load balancer like LVS to further scale out your infrastructure.
StatsRelay is designed to be fast and is the primary reason it is written in Go. The StatsRelay daemon has been benchmarked at handling 200,000 UDP packets per second. It batches the metrics it receives into larger UDP packets before sending them off to the Statsd backends. As the string processing is faster than system calls, this further increases the amount of metrics that each Statsd daemon is able to handle.
When shouldn't you use StatsRelay?
In many cases you might want to run Statsd on each client machine and let it aggregate and report metrics to Graphite from that point. If each client only produces unique metrics names this is the approach you should use. This doesn't work, however, when you have multiple machines than need to increment the same counter, for example.
What's wrong with Statsd?
Etsy's Statsd tool is really quite excellent. Its written in NodeJS which, event driven it may be, is not what I would call fast. The daemon is a single process which only scales so far. Testing showed that the daemon would drop packets as it approached 40,000 packets per second as it would peg the CPU core it ran on at 100%. I needed a solution for an order of magnitude more traffic.
But, Hey! Statsd comes with a proxy tool!
New versions of Etsy's Statsd distribution do come with a NodeJS proxy implementation that does much the same thing. Similar to the Statsd daemon the code, in single process mode, would top out around 40,000 packets per second and 100% CPU. Testing showed that the underlying Statsd daemons were not getting all of that traffic either.
I checked back on this proxy after it had been developed further to find that it had a forkCount configuration parameter and what looked like a good start at a multi-process mode. I tested it again with my statsd load generator which produced about 175,000 packets per second, which was well inside the packets per second I needed to support in production. Setting the forkCount to 4 I found 4 processes each consuming 200% CPU and 2G of memory each. The code was still dropping packets.
At about 175,000 packets per second this Go implementation uses about 10M of memory and about 60% CPU. No packets lost.
Fork the StatsRelay repository and submit a pull request on GitHub.
Things that need work:
- Add health checking of the underlying Statsd daemons
- Profile and tune for speed and packet throughput