Lies, Damned Lies, and Statistics
The Value of Numbers
I’m pretty sure whoever coined the phrase Lies, Damned Lies, and Statistics wasn’t referring to “stats” as used in running production websites and web services.
When it comes to running large scale and highly available systems, stats are your best friend. When things are going well, collecting and graphing performance data can help you to monitor and plan for growth. Once problems arise, it can mean the difference between finding and immediately addressing the underlying issue, or randomly poking and praying.
Push is Your Friend
Push notifications drive engagement. We’re early in to things here at Madefire, but we’ve already been sold on the value of push notifications. Our primary use is the announcement of new titles, and the nearly immediate bumps we get are exciting to watch and make us glad that we’re built to scale.
Increasing Users Increases Requests
As would be expected, adding a substantial bump in concurrent users causes a corresponding rise in the number of requests we see in the API. We always make sure someone is around to watch things for a while after we send a notification, to ensure everything runs smoothly. What they watch is the subject of this post.
We make extensive use of Graphite and Statsd for collecting and viewing stats. We feel it’s essential for all of the data to live in a single place. You can never know ahead of time when you’ll need to see two pieces of data on the same graph in order to quickly (remember: the system is down or under-performing) find a non-obvious root cause. Graphite and Statsd do exactly this. The time elapsed between deciding to track something and having data on a graph is measured in minutes.
Tracking Numbers Across the Stack
We track things at several levels in order to be able to find problems that span them or at least where the cause and effect does.
A hypothetical example:
- * An alarm is raised, the load-balancer just started taking twice as long to respond to requests…
- * Check the requests per second, nothing out of the ordinary there, there’s a small bump, but that may be a result of clients timing out and retrying.
- * Look at the distribution of requests, nothing out of the ordinary there.
- * What about the database? Oh there’s a spike in load at exactly the same time the response time jumped up. That’s looking promising.
- * Since we know the API is serving a normal amount and distribution of requests, we’ll assume the DB is too for now. It may not be and we’ll need graphs for that too, but it’s less likely.
- * Let’s look at the DB host’s detailed numbers, IO_milliseconds is way up, maybe a drive acting up.
- * Look closer, log on to the box, promote a slave to be the new master.
In this made up scenario we started at the load-balancer and worked our way down to slow IO on a DB’s drive. In doing so we started at service-wide stats via the load-balancer, ELB in our case which we monitor with a script that polls CloudWatch data and passes it along to Statsd using the Statsd client library. The next step took us to the distribution of requests. For that we make use of the Django-Statsd app. After that we went and took a look at system-level metrics which Diamond has done a great job of collecting for us, though in our case we actually make use of RDS so we wouldn’t be seeing disk/io stats for the DB.
That’s just one pass through our stats. We collect a decent set of information in our backend code that isn’t directly related to requests, and we’ve written and contributed back a Diamond plugin to monitor Memcached instances. Our stats DB records each time code is deployed, a box is added to or removed from our systems, or any number of other operational actions. It’s important to be able to trace problems back to human actions. We even have a monitoring script that checks for AWS outage events via the regions API and throws that data in the mix.
All of this is to say that you never know what it is going to be useful to track until you need it. Your best bet is to have a system where it’s so easy to track things that you just do it with most everything you can think of, that way it’ll be waiting on you when you need it.
All of the information in the world isn’t going to help you track down a problem at four in the morning if it’s not organized. In fact, being bombarded with numbers or having to dig through things then is only going to hurt.
That’s where dashboards come in. As you get to know your setup, you should build out collections of graphs that give you an overview of the system as a whole which allows you to quickly drill down to an area where you need more information. We start with a main dashboard with a set of graphs that let you look at the overall health of things. From there we have a set of focused dashboards that show you more detailed information about pieces of the system or host-level metrics for groups of boxes.
That’s Not Normal
The final piece of the process is knowing what things should look like. Once you’ve built your dashboards, this part couldn’t be easier. You just need to look at them all the time. Then you’ll know what to expect things to look like and be able to quickly spot when something is out of the ordinary. It may not always be THE problem, but it’ll almost always point you in the right direction.