Our adventure in metrics gathering is a bit long-winded for a story. I've divided it into sections so you can jump to the details you might want to see: Software, Glue, Hardware, Details, Headaches, Numbers, and Final Thoughts.
Please drop comments noting your performance numbers or to ask questions. We're happy to share what we can.
We started off with a single node, our new stats collection node, running collectd with the write_graphite plugin sending data to a carbon cache via loopback. We used the debian graphite-carbon package, and manually set up a graphite-web installation. The first data was localhost only metrics, and this kept us entertained for a few minutes.
I was afraid to install Collectd on all hosts and just walk away. So to quickly get host metrics I found the ganglia plugin for Collectd, which allows it to receive metrics from hosts running gmond and output them to Graphite. Using this plugin wasn't the easiest thing in the world, but I'll detail that seperately.
We moved on to Collectd installations on all of our hosts after a week or so of playing with Graphite and other bits. By default Collectd is much more discrete in its measurements of the system than Ganglia. For some examples: Ganglia has cpu aggregate metrics, but Collectd has per-cpu metrics, Ganglia has ethernet metrics, but Collectd has per-interface and per-protocol metrics.
Our Collectd setup is using the Collectd 'network' [sic] plugin for transmission of metrics to the central stats host, which has a Collectd instance running to write them to graphite-carbon. This is a simplification of our setup and more details are below.
At roughly the same time we started toying with Statsd, which handles aggregation of various data types. The first use of Statsd was web hit stats from Typepad. This included counts, byte sizes, and rendering time sliced by request method and response code.
One terminology change we made from the start was to call the 'ms' data type in statsd an 'aggregate'. It comes from being measurement of time in statsd, but the aggregation functions are useful for more data types. Someday a release of statsd will include this change as well.
We've moved on to collecting event counts, body sizes, execution times, and many other things that I can't even think of right now. Almost all of our products both internal and external have at least one metric being collected and analyzed right now, and the majority of these are handled through statsd for simplicity.
We added simple proxies at two layers in our stack for multiple reasons. Franck has shared these via our devops-tools repo.
First, there is a proxy running on the stats host between collectd and carbon. The purpose of this proxy is to adjust the keys of metrics to a more organized tree for us.
Second, there is a proxy running on every host. This proxy receives lines destined for statsd via loopback and does a few things. The base of the key is generated and automatically prepended to all metrics pushed through this proxy, which keeps that format in a common place. Counters and aggregates are multiplexed to both a host and an app level prefix, which allows us to analyze the metric per host or for the whole cluster easily and accurately (averaging averages is inaccurate). Gauges are not multiplexed due to their nature.
Examples of the keys are in the details section below.
Our stats machine is a basic server using no virtualization, 8 cores, 24GB ram I think. This is going to get upgraded, but not because of CPU or RAM saturation.
For storage, initially we had a 15k RPM SAS drive as our backing store for carbon, but this was quickly outgrown and replaced with an SSD. We ran out of space on the filesystem and started a RAID-0 of SSD for storage, which is still in use today.
Metric Tree Layout
Our graphite metric tree is layed out as a planned heirarchy to keep application and host metrics of related hosts and applications closer together in the tree, but still sorted to keep the number of nodes in one folder managable.
Some key examples (pun intended):
The first two are examples of collectd metric keys, and are formatted by the carbon proxy in the Glue section above.
The last two are collected by statsd and are formatted by both the proxies mentioned in the Glue section above.
We ran out of IO bandwidth first, very early on. Switching to an SSD was the fix for that.
Then we ran out of disk space near 70k metrics. Creating a RAID-0 of multiple SSD was our fix.
Finally, we ran out of CPU time above 200k metrics. We now run multiple copies of carbon-cache, load balanced by carbon-relay, which are in turn load balanced by the carbon proxy.
We've tested up to 300k metrics and our IO bandwidth starts to top out. In this scenario the system continues to work, but starts consuming memory to handle it. This is noted as a headache below.
Collectd's Ganglia Plugin
First off, the debian version of collectd has a ganglia plugin included that isn’t linked against libganglia properly. After a short period of knocking around I decided to force the dynamic linker to preload libganglia, which let me get this running quickly. This was done by adding the following to /etc/default/collectd:
export LD_PRELOAD # kick it harder till it starts up :D
Then I discovered that ganglia’s use of multicast means that in an unprepared network I can’t subscribe to multicast groups across routers without some serious config.
Long story short, I enabled IGMP and multicast routing on all our network gear for a few weeks while the ganglia shim was in use. A side effect was that our Mac Mini desktop in the datacenter was storming the network because of some broken multicast bonjour thing. This has been fixed in newer versions of OSX apparently, but I solved that problem by turning off the Mac.
The Case of the Disappearing Metrics
When you run out of IO bandwidth with carbon-cache, the daemon will start using an in-memory cache to help keep things going. I believe the intended scenario would have IO pressure holding writes back, which carbon would then try to sort so it can make fewer seeks happen by optimizing in user space.
Sadly, this wasn't working for us. The cache max size was probably set too high, which we adjusted down. This caused carbon to use its flow control feature to slow down the writes from collectd, which then started consuming memory instead.
At this point, but without deep testing. I belive the flow control option can become a dangerous thing in scenarios where you might prefer to drop metrics rather than exploding. We still have it turned on, but we haven't hit our limit again.
Our single stats machine is handling about 250k carbon metrics. The vast majority of which are updated every 10 seconds, which is our flush interval for both collectd and statsd.
The stats host is averaging about 15k IOPS/sec on our SSD RAID, with occasional spikes up to 25k and beyond.
The combined technology of Collectd, Statsd, and Graphite all come together in our environment to create a very nice toolset. We were able to tweak the way these systems interact by writing some simple data proxies. The fact that statsd and carbon both speak line protocol made this very simple.
Planning out the metric tree has helped to keep things organized, and queries smaller/simpler than they might otherwise be. We lessened the problem of adjusting the metric tree layout by hiding some of the path layout in common proxies, keeping it out of application code.
It took a little time to get our system to perform up to a level we were comfortable with, but it has been rock solid once we got there. Memory consumption on the stats host has been pretty much constant for weeks now.
Our proxies, apache log scraper, and a few other tools are already up in the saymedia/devops-tools.git repo. Go take a peek, we'll share more soon.
Next I'll cover tell a shorter story about the Riemann network monitoring system.