On my recent post about our current job openings, I linked to our status page which showed our near-100% uptime for the past 6 months. We had just weathered a bit of downtime due to a big sale.
The universe sent us another tricky situation this week, causing about an hour of spotty performance during a surge sale that, as it turned out, we should have been able to handle.
I’d like to talk a little bit about these recent downtime events, what caused them, and what we’re doing to prevent this kind of thing in the future.
A Bit of Background
The vast majority of the time, Tito isn’t an inordinately busy application. We serve several thousand organisers who sell a few thousand tickets every day. The problems occur when an event goes on sale at a given time, and lots of people hit the site at once.
This makes it hard to manage: unless we’re in direct contact with the organiser, we can’t anticipate one of these “spikes”, so we have to plan that they could happen at any given time.
Way back in the early days, we hadn’t done any load-testing, and there were a number of significant failures to handle what we would now consider moderate load. We knew we needed to do something, so we took the following steps: