Dates:
Start Time: Thursday, December 12, 2024 at 19:24 UTC
End Time: Thursday, December 12, 2024 at 20:54 UTC
Duration: 1 hour and 30 minutes
What happened:
Log lines submitted to Mezmo for ingestion into Log Analysis were never made available in our WebUI for Searching, Graphing, and Timelines – neither during the incident, nor afterwards. This affected a significant number of accounts. Log lines were still passed through Pipeline and made available at all times within Live Tail. The log data still triggered both Telemetry Pipeline and Log Analysis based alerts, and were still archived in both places.
Why it happened:
A single pod within our indexing service ran out of disk space. This was due to a sudden increase in the volume of log lines sent to Mezmo by accounts that happened to be assigned to this pod.
Our pods are configured to limit how much disk space is available for writing new data; these limits should have prevented any pod from running out of disk space, in any scenario. After the incident, we discovered that the limits had been configured incorrectly, which explains why it was possible for this pod to run out of disk space.
Our service is designed to tolerate the loss of a single pod within its indexing service without any widespread impact to customers. Instead, for reasons still under investigation, the impact expanded to many other pods.
Our service is also designed to retain submitted log lines, even if the indexing portion of our service cannot process them immediately; these log lines can be indexed later, when the service is functional again. In this incident, however, the log lines were never indexed. The reason for this failure is also under investigation.
How we fixed it:
We restarted the pod that had run out of disk space. It immediately had enough free disk space to accept new log lines for indexing. All other indexing pods also returned to a normal operational state.
What we are doing to prevent it from happening again:
We have properly configured our pods to prevent them from running out of disk space. This step alone should prevent any recurrence of the same problem.
We have updated our monitoring to send high priority alerts when any indexing pod is in danger of running out of disk space. These alerts were in place before, but set to “low” priority; they did not come to our attention in time to prevent the incident.
We will continue to actively investigate why the impact spread to other indexing pods and why log lines were not retained for indexing in the future.