Madefire Press

python

Lies, Damned Lies, and Statistics

Posted by Ross ,
operations | Permalink | 1 Comment »

The Value of Numbers

I’m pretty sure whoever coined the phrase Lies, Damned Lies, and Statistics wasn’t referring to “stats” as used in running production websites and web services.

When it comes to running large scale and highly available systems, stats are your best friend. When things are going well, collecting and graphing performance data can help you to monitor and plan for growth. Once problems arise, it can mean the difference between finding and immediately addressing the underlying issue, or randomly poking and praying.

Push is Your Friend

Push notifications drive engagement. We’re early in to things here at Madefire, but we’ve already been sold on the value of push notifications. Our primary use is the announcement of new titles, and the nearly immediate bumps we get are exciting to watch and make us glad that we’re built to scale.

A push notification causes a large spike in hourly visits and continues on for over 24 hours.

A push notification causes a large spike in hourly visits and continues on for over 24 hours.

Increasing Users Increases Requests

As would be expected, adding a substantial bump in concurrent users causes a corresponding rise in the number of requests we see in the API. We always make sure someone is around to watch things for a while after we send a notification, to ensure everything runs smoothly. What they watch is the subject of this post.

The large spike in the number of API requests made following a push notification.

The large spike in the number of API requests following a push notification.

We make extensive use of Graphite and Statsd for collecting and viewing stats. We feel it’s essential for all of the data to live in a single place. You can never know ahead of time when you’ll need to see two pieces of data on the same graph in order to quickly (remember: the system is down or under-performing) find a non-obvious root cause. Graphite and Statsd do exactly this. The time elapsed between deciding to track something and having data on a graph is measured in minutes.

Tracking Numbers Across the Stack

We track things at several levels in order to be able to find problems that span them or at least where the cause and effect does.

A hypothetical example:

  • * An alarm is raised, the load-balancer just started taking twice as long to respond to requests…
  • * Check the requests per second, nothing out of the ordinary there, there’s a small bump, but that may be a result of clients timing out and retrying.
  • * Look at the distribution of requests, nothing out of the ordinary there.
  • * What about the database? Oh there’s a spike in load at exactly the same time the response time jumped up. That’s looking promising.
  • * Since we know the API is serving a normal amount and distribution of requests, we’ll assume the DB is too for now. It may not be and we’ll need graphs for that too, but it’s less likely.
  • * Let’s look at the DB host’s detailed numbers, IO_milliseconds is way up, maybe a drive acting up.
  • * Look closer, log on to the box, promote a slave to be the new master.

In this made up scenario we started at the load-balancer and worked our way down to slow IO on a DB’s drive. In doing so we started at service-wide stats via the load-balancer, ELB in our case which we monitor with a script that polls CloudWatch data and passes it along to Statsd using the Statsd client library. The next step took us to the distribution of requests. For that we make use of the Django-Statsd app. After that we went and took a look at system-level metrics which Diamond has done a great job of collecting for us, though in our case we actually make use of RDS so we wouldn’t be seeing disk/io stats for the DB.

That’s just one pass through our stats. We collect a decent set of information in our backend code that isn’t directly related to requests, and we’ve written and contributed back a Diamond plugin to monitor Memcached instances. Our stats DB records each time code is deployed, a box is added to or removed from our systems, or any number of other operational actions. It’s important to be able to trace problems back to human actions. We even have a monitoring script that checks for AWS outage events via the regions API and throws that data in the mix.

All of this is to say that you never know what it is going to be useful to track until you need it. Your best bet is to have a system where it’s so easy to track things that you just do it with most everything you can think of, that way it’ll be waiting on you when you need it.

Information Overload

All of the information in the world isn’t going to help you track down a problem at four in the morning if it’s not organized. In fact, being bombarded with numbers or having to dig through things then is only going to hurt.

That’s where dashboards come in. As you get to know your setup, you should build out collections of graphs that give you an overview of the system as a whole which allows you to quickly drill down to an area where you need more information. We start with a main dashboard with a set of graphs that let you look at the overall health of things. From there we have a set of focused dashboards that show you more detailed information about pieces of the system or host-level metrics for groups of boxes.

A view of several graphs on one of our dashboards.

A one-hour window of one of our host-level data dashboards. These are our miscellaneous hosts. They do odd jobs: gateway server; log collection; background workers; and the stats host itself.

That’s Not Normal

The final piece of the process is knowing what things should look like. Once you’ve built your dashboards, this part couldn’t be easier. You just need to look at them all the time. Then you’ll know what to expect things to look like and be able to quickly spot when something is out of the ordinary. It may not always be THE problem, but it’ll almost always point you in the right direction.

iPad Video Wall for Comic-Con

Posted by Ross ,
hacking | Permalink | 17 Comments »

Being an effective engineering team is often about making the right trade-offs. When our CEO proposed a 21 iPad video wall for Comic-Con San Diego everyone was excited at the ridiculousness of it. The initial excitement was tempered a bit by the timeline we were working with. With less than two weeks before the convention and plenty of “normal” work to do we we’d have to go the quick and dirty route. Once the wall was at the convention the result was going to be on display for thousands of people, ten hours a day, for three or more days. Flaky and unreliable was not an option.

18 iPads being configured

18 iPads fresh from the Apple Store. 3 more would join the next day. We’re setting up the iPads, assembly line style. We configured the first, backed it up to iCloud, and restored that to the rest. There was still a bit of manual configuration to do, but it saved a lot of work.

“Real” solutions would involve tightly synchronized clocks or timecode. NTP is made for that purpose, but in the walled garden of iOS apps it wouldn’t be that simple. We’d have to maintain a clock/time in our app that synchronized to a central server frequently and then instruct the app to play a particular video at a specific time. Another option was to provide a syncing signal via the headphone jacks of the devices, but that involved hardware, something we didn’t really have the time to try. In the end, we decided on a simpler approach.

Our initial implementation used a multi-threaded server which connected to each client. The manager had a list of movies to play and the duration of those movies. It would loop through the videos, quickly sending out commands telling the clients to start playing, wait for the video to finish playing and then tell the clients what to play next. If the connections were quick and reliable enough it should work. Thankfully the human vision system is rather forgiving.

Once the iPads were configured we pushed our proof of concept app to all of them and started it up.

The truth is that it worked pretty well. The one change we ended up needing to make was to allow for the time it takes an iPad to load the video before playing. That was addressed by making it a two-step process. We first tell the the clients to prepare and then after waiting a sufficient amount of time, tell them to play the video. It’s not perfect. We’ll see a bit of skew on an iPad every once in a while, but you generally have to be looking for it and even then it’s not objectionable.

Our first content was a set of individually crafted videos with a counter on each screen so that we could detect drift. That’s not very interesting. Now we needed to cut up some video content to fit the wall. To do this we coded up a python script that crops a version of the source video for each iPad to play using avconv/ffmpeg.

There was a lot of drawing and annotating pictures involved with getting the slicing right. We went back and forth quite a bit trying to decide whether we should fit all of the source pixels in to the output videos or allow for the borders by dropping the pixels that fall behind them. The iPad has a huge surround, over 20% of the size of the screen itself. In the end we decided to go with dropping the pixels as there was too much distortion of the content when there’s a physical jump of 1¾” between rows.

At this point we had a working iPad video wall, but it wasn’t particularly fault tolerant, at least not enough to stand up to 10 hour days in a crowded (in terms of RF traffic) convention hall. The lack of time took away a lot of our options here. We thought about trying to connect the iPads over USB, maybe even using MIDI, but the route was a non-starter. Unfortunately USB hubs that can power multiple iPads are hard to come by so we’d lose the ability to charge the iPads while playing.

Once wired connections were ruled out we were left with the option of making the wireless connections as robust as possible. Stable is one thing, but they will never be fool-proof. To handle hiccups we spent some time making all of the code more robust. If an iPad drops out it will immediately start trying to reconnect so that it’ll be ready for the next round. There were several potential failure-cases and we had to look closely at each of them to ensure the iPad would retry as expected without beating the server to death. It’s not perfect, but good enough for a couple days of hack-a-thon style work.

A test run of the 7×3 setup on a table in our office.

We’re pretty proud of the results. If you’re at Comic-Con stop by our booth, #4902, and check it out. We have some great signings scheduled and you’ll be able to check out our app on our demo iPads. Otherwise we plan to open source the results once we’ve recovered from the event and had a little bit of time to clean up the code. Once we do the iPad Video Wall Code will live on github.

Footage of the wall in our booth at Comic-Con.

Get the RSS feed

Sign-up for our newsletter