Start Time: Friday, February 10, 2023 at 16:45 UTC
End Time: Friday, February 10, 2023 at 18:14 UTC
Duration: 89 minutes
Searches returned results slowly or not at all. No data was lost and ingestion was not halted.
Why it happened:
In a previous incident on February 6, 2023 (more details at https://status.mezmo.com/incidents/3yl9x1t7qcw5),,) two pods storing logs were temporarily removed from the pool of pods available for receiving and inserting new batches of logs into our data store. The pods continued to return results for previously processed logs. We took this step because the pods had fallen behind on their tasks, which we believe was a consequence of an ungraceful shutdown during the incident. We gave the pods several days to catch up on tasks and then made them available for insertion of new logs into the data store again, a change we expected to have no impact.
One of the pods immediately began integrity checks to confirm the same data existed on its local disk and on our S3 storage. As a side effect of the previous incident, the pod incorrectly determined that data was missing from the local disk and began sending http requests to our S3 storage to locate the missing data. In fact, the data in question is designed to only reside on local disk and was not supposed to be stored on S3.
The requests failed with 404 errors when the data was not found on S3 (as expected). Every new attempt to retrieve search results generated another request. The rate of requests was high enough to slow down all requests related to search results within the pod’s zone (one out of three total). This led to search results being returned slowly or not at all.
How we fixed it:
We removed the pod from the pool available for receiving and inserting new batches of logs into our data store. The pod continued to return results for previously processed logs.
What we are doing to prevent it from happening again:
We marked this pod to remain unavailable for new logs until all previously processed logs on the pod have passed their retention period, whose maximum is 30 days. At that time, the pod will be rebuilt and begin accepting newly submitted logs again.
We’ll fix the logic of our search engine so it doesn’t request data from S3 that is intentionally not stored there. This will prevent the widespread 404 errors that slowed down all searching, should a pod again incorrectly determine it is missing data from its local disk.
We have added alerting and monitoring to detect high latency in search speeds and the average time to compact newly inserted logs.