Start Time: Monday, November 22, 2021, at 19:01 UTC
End Time: Tuesday, November 23, 2021, at 02:04 UTC
Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. Some accounts (about 25%) were affected more than others. For all accounts, the ingestion of logs was not interrupted and no data was lost.
Why it happened:
Upon investigation, we discovered that the service which parses all incoming log lines was working very slowly. This service is upstream to all our other services, such as alerting, live tail, archiving, and searching; consequently, all those services were also delayed.
We isolated the slow parsing to the specific content of certain log lines. These log lines exposed an inefficiency in our line parsing service which resulted in exponential growth in the time needed to parse those lines; this in turn created a bottleneck that delayed the parsing of other log lines. The inefficiency has been present for some time, but went undetected until one account started sending a large volume of these problematic lines.
How we fixed it:
The line parsing service was updated to use a new algorithm that avoids the worst-case behaviors of the original, as well as improving performance for line parsing in general.
From then on, the parsing service just needed time to process the backlog of logs sent to us by customers. Likewise, the downstream services – alerting, live tail, archiving, searching – needed time to process the logs now being sent to them by the parsing service. The recovery was quicker for about 75% of our customers and slower for the other 25%.
What we are doing to prevent it from happening again:
The new parsing methodology has improved our overall performance significantly. We are also actively pursuing further optimizations.