This morning at 5am, Tito became unavailable for all customers. We were made aware of this at around 8am, investigated, pushed an update and resolved the issue by 8:45am.
As a service that has enjoyed 100% uptime in the last six months (and 99.99% uptime in the last year), this was very frustrating. We have had outages in the past that were out of our control, where our downtime was caused by a third-party service (along with many others), but this time it was due to a configuration issue on our end.
In 2014, one of our caching (redis) servers went offline for maintenance and we didn’t have auto-failover. Since then, we implemented a master-slave redis setup with auto-failover via Amazon’s Elasticache service.
For both their redis and memcached offerings on Elasticache, Amazon offer a “master” DNS endpoint that gets updated automatically if there is a failover event. Tito is set up to use that master or “Configuration endpoint” for memcached, but the redis setup looks slightly different. When it was set up originally, our assumption was that the write master endpoint would stay the same in the event of a failover. Unfortunately, we failed to realise at the time that Elasticache provides a “Primary Endpoint”, and pointed Tito’s redis configuration to the write master.
At 5am this morning, Elasticache initiated a failover event that resulted in the write master being replaced by the read slave, the slave being promoted to write master, and the old master being replaced with a new slave. Since Tito was pointing to the old master, which was now a read-only slave, any attempts to write failed, causing the downtime.
Updating Tito to use the “Primary endpoint” for redis solved the issue.
What steps are we taking?
There were two main issues here.
1) The configuration issue itself 2) The three hour response time
For the issue itself, Tito is now pointing to the correct endpoint, and should a failover event occur in the future, there would be minimal, if any, downtime.
For the second issue, we were lucky in a way that this happened at 5am, which is the quietest time of day for us, and usually the time we run any database-related patches or restarts.
As Tito grows, it will become more manageable for us to detect and address issues like these to the point that they go mostly unnoticed. In this case, 3 hours is disappointing, but hopefully acceptable given our six month track record.
We’ve worked hard to ensure that Tito is a solid platform and that when things do go wrong, that we have planned for it and configured for it. Unfortunately this time, our configuration let us down.
If you have any questions about any of this, please feel free to send them to firstname.lastname@example.org.