Hey everyone,
First we'd like to apologize for the service outage you may have experienced yesterday. This has since been fixed, but we'd like to give you some context and share with you the somewhat unexpected twists and turns investigating this issue.
Discovery
We began receiving alerts and customer reports in the early mornings hours, which triggered our escalation procedure and woke up our colleagues lucky enough to be in the on-call rotation. Upon first glance, the web application was not allowing users to sign in. We rely on a small, but distributed disk-backed cache to store and look up authentication information, including signing in. It turned out none of the Redis servers used by this cache were even running, and they had been OOM killed, leaving pid files behind and preventing automatic restarts.
Investigation
After digging in, it was realized that our Redis servers which are used for managing authentication had died, resulting in the inability for clients to access their account data. A tool was used to track batches of logs was not appropriately sized to handle the traffic spike, causing it to overwhelm Redis to the point of failure. Redis does try to recover from this but bad files were left behind causing the service to enter a reboot loop. These events caused logs to back up in the process queue along with the metadata to the point where it began to overwhelm the log processing nodes. Upon a failure with Redis our applications failover to MongoDB, however the sheer number of authentication requests led to a cascading failure scenario, taking down MongoDB as well. The end result was the inability to access the application entirely to retrieve logs.
Analysis
The analysis of the root cause showed that our batch tracking instrumentation was working correctly and as intended. At times our system does experience indexing delays which are usually taken care of quickly, but in this unique edge case, the increase traffic was just enough to push it over the edge. As a result of the analysis, we have discovered and began work on removing the bottlenecks within the pipeline that were involved in this particular incidents.
Solution
Although we have began work on removing the bottlenecks that caused these cascading failure. We want to make sure that this never happens again. The initial fix was to resize specific portions of the cluster to handle a much higher load of traffic. Secondly, we are retooling the batch tracking system to reduce the direct impact to our Redis clusters and increasing the observability instrumentation around this feature set. We will also be beefing up our monitoring for Redis as well.
Reflection
More than anything else, we value making sure the customer experience is optimal, above all else. Although we work to keep outages rare, they do happen and we have learned a valuable lesson from it. As we scale up features this re-enforces our goal of deep instrumentation, tracing, and observability throughout our platform. We will continue to improve our product and process, and are grateful for all the crucial feedback from our customers that have helped make us who we are today.
Regards,
The LogDNA team