Start Time: Thursday, June 2, 2022, at 20:25 UTC
End Time: Thursday, June 2, 2022, at 20:50 UTC
The ingestion of logs was halted for about 25 minutes. During that time, newly submitted logs were never ingested and therefore not available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving.
Why it happened:
We manually reverted our ingester service to an older version (to solve a minor problem unrelated to this incident). During the procedure, the version of the container was reverted, but not the container’s configuration. Because of this versioning mismatch, logs from the ingester stopped being accepted by a downstream service (the “buzzsaw broker”). The ingester is currently not designed to confirm logs are accepted by downstream services; therefore it returned http 200 messages to our customer’s agents, indicating logs had been successfully received. At this point the agent discarded any locally cached log files. Consequently, all log lines sent during the incident (25 minutes) were never ingested.
How we fixed it:
We reverted the container’s configuration correctly, so it matched the version of the container itself. Ingestion began working normally again.
What we are doing to prevent it from happening again:
We will review and update our runbooks for reverting services to earlier versions to prevent similar mistakes. We also plan to automate the reversion process.
We will add internal confirmations to the ingester so it is always certain log lines were received by downstream services. This will prevent the ingester from sending erroneous 200 messages back to the agent, should the ingester be unable to pass log lines downstream.