Start Time: April 28, 2021 at 12:37 UTC
End Time: May 3, 2021 at 00:54 UTC
What happened:
Newly submitted log lines from all customers were significantly delayed before being available in our WebUI for searching, graphing, and timelines. Alerting, Live Tail, and the uploading of archives to their destinations were significantly delayed as well. The incident was opened on April 28, 12:37 UTC.
Typical mitigation steps were taken, but unsuccessful. Live Tail and alerting -- which were also significantly degraded -- were halted, about 14 hours after the start of the incident. This step was taken to keep other services, such as Search, functioning and give more resources to processing log lines. Logs submitted before the incident continued to be searchable.
By May 1, 19:17 UTC, about 99% of newly submitted logs were again available in our WebUI at normal rates. Other essential services needed more time and manual intervention to recover. The incident was closed on May 3, 00:54 UTC.
Ingestion of new logs lines from clients continued normally throughout the incident.
Why it happened:
We deployed an updated version of our proprietary messaging bus / parsing pipeline. This version had been tested in staging and multiple production regions beforehand and worked as expected. It was deployed and worked normally in production for four days. The cumulative traffic to our service over those four days revealed a performance issue that affected the processing of new log lines: logs were processed, but at a very slow rate. We’ve identified the cause of the slow performance as an update to node.js (version 14) that was part of the new version of our messaging bus.
How we fixed it:
Once the source of the failure had been identified, we reverted our messaging bus to its last stable version, which kept the delays in processing from degrading further. Our services still needed to process logs ingested up to that point, which required time, manual intervention, and more resources. We temporarily increased the number of servers dedicated to processing logs by about 60%. We also halted Live Tail and alerting, which were degraded almost to the point of being non-functional.
Through the combination of these efforts, all logs were eventually processed and our service was again entirely operational.
What we are doing to prevent it from happening again:
During the incident, the new version of our messaging bus was reverted to its previous version. The version in production today does not contain the upgrade to node.js 14, which caused the performance degradation. We’ve removed node.js 14 from any future upgrades until we’ve had time to carefully examine its performance issues.