On Our Recent Downtime
On my recent post about our current job openings, I linked to our status page which showed our near-100% uptime for the past 6 months. We had just weathered a bit of downtime due to a big sale.
The universe sent us another tricky situation this week, causing about an hour of spotty performance during a surge sale that, as it turned out, we should have been able to handle.
I’d like to talk a little bit about these recent downtime events, what caused them, and what we’re doing to prevent this kind of thing in the future.
A Bit of Background
The vast majority of the time, Tito isn’t an inordinately busy application. We serve several thousand organisers who sell a few thousand tickets every day. The problems occur when an event goes on sale at a given time, and lots of people hit the site at once.
This makes it hard to manage: unless we’re in direct contact with the organiser, we can’t anticipate one of these “spikes”, so we have to plan that they could happen at any given time.
Way back in the early days, we hadn’t done any load-testing, and there were a number of significant failures to handle what we would now consider moderate load. We knew we needed to do something, so we took the following steps:
1) We created a number of automated load-testing scripts simulating actual sales.
2) We identified and rectified a number of obvious places our code was putting undue pressure on our database.
3) We migrated to an infrastructure setup that was simple, robust and distributed, allowing us to scale out most of the pieces on demand.
4) We run at a capacity that far, far exceeds our day-to-day needs. This gives us both redundancy in the event of failure, but also the ability to handle most spikes as day-to-day, without knowing them in advance.
Since we implemented the above, incidents have been few and far between, and before this month, we’d almost forgotten what it was like.
This Month’s Incidents
The incident earlier this month on the 2nd November was a simple case of not being prepared. We had scheduled additional servers to come online to handle load. Unfortunately, there was a miscommunication of the time that the load was expected. During this peak, we identified a few places that our caching was missing.
We have since made a change that extends the reach of our caching in the places we identified. This will take take some pressure off if another incident like this comes along.
This week on the 13th November, we had another, similar incident. A surge of expected traffic hit the site, but the servers responded as though it were a far greater number. It turned out that the customer had been trialling one of our new, in-development features. Even though they didn’t end up using the feature, it was still enabled for their event, and caused a knock-on effect.
We haven’t done any performance testing of that feature, and it turned out that it led to a non-optimised query in our database. The particular event also had quite a number of tickets on their event homepage, which shouldn’t have been a problem, but it was triggering the slow-query quite a lot. As we’ve learned, a slow-query combined with even a modest load is death to a database.
The Future
Building Tito has always been a frustratingly tricky balancing act between keeping existing customers happy, building new features, and making sure that new features don’t impact our ability to handle a spike in traffic. Or survival, as some might call it.
As we grow, load-testing will become more and more important for us. Our goal will be to get to a point where, on any given day, our default infrastructure will handle any of the biggest spikes we’ve seen.
In the near term, we’ve put in place the fix to extend the caching mentioned above, and we’re working on an iterative release of our checkout that contains a number of optimisations over our current site.
We take these incidents really, really seriously, and we are confident that we’ll get to the position that we’ll be able to handle the vast amount of potential busy events that our customers can throw at us.