On the morning of Sunday 19th January, several of our customers reported that events had disappeared from their Tito accounts. Independent of that, one of our engineers reported that there had been an issue with a routine script to gather customer usage metrics.
The incident affected just over 7% of Tito accounts. Ultimately, there was no loss of data, but during the incident rows were removed—and later restored—from a table containing information about those accounts, and as a result caused critical downtime for the affected accounts.
The incident occurred at 10:25GMT and was fully resolved by 14:32, so it was just over 4 hours of downtime.
How did this happen?
We are going through a transition period here at Tito, moving from a startup mentality as a company and maturing into an organisation that is more process-driven. In addition to business processes, we’ve been working directly with AWS and two third-party firms to get up to speed with Amazon’s Well-Architected standards.
Until the above is complete, we have put in place a hiring-freeze. Four engineers on our team have access to our production database, all of whom have a lot of experience both as software developers and with the Tito infrastructure and codebase. Sunday’s incident was a result of a low-risk operation that ought to have been read-only, but because of a bug, deleted database rows in the database rather than deleting from an in-memory array.
As a company, we’re aware of our need to operationalise much of how we go about our day to day work, and as outlined above, we’ve taken the steps towards doing that. Incidents like the one on Sunday are the unfortunate result of those processes not yet being in place.
What have we done?
As a company, we operate a no-blame culture, and we look to leadership to take full responsibility for the actions of our teams. In this case, I have put too much faith in our ad-hoc processes, and I have not pressed enough for putting operational processes in place. In this case, for example, running the script against a database copy, or a read-only version of our database would have mitigated the risk.
Internally, we have drafted a full incident report of exactly what happened. We have a timeline of events, and have documented the lessons learned. We have identified the local cause and the root cause of the incident.
As part of our work on operationalising as a company, we have already identified strategies to mitigate the above kinds of risks.
Thankfully, the systems that we do already have in place, such as having regular, automated backups, and a simple restore process, meant that we were able to resolve this instance with no data loss.
What will we be doing in future?
We will continue our work towards operationalising our approach to building and operating our systems. From the lessons learned, as a team, we will identify the steps required to meet established standards such as AWS Well-Architected, to increase confidence both internally and for our customers. We will ensure that routine scripts like this are not run against live data, even when the risk is low, and we will put in place policies similar to the ones we already have for deploying code against routine metric tasks like this.
I’m deeply sorry that we didn’t have processes in place to prevent an incident like this. We got lucky that it was on a Sunday morning, but that’s no excuse. We are committed to operational excellence. This wasn’t good enough, and we’ll be putting systems in place to ensure it doesn’t happen again.