Indexing, Livetail Performance, Search, and Alerting Delays
Incident Report for Mezmo Status Page
Postmortem

Dates:

Start Time: Monday, November 22, 2021, at 19:01 UTC

End Time: Tuesday, November 23, 2021, at 02:04 UTC

Duration: 7:03:00

What happened:

Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines.  Some accounts (about 25%) were affected more than others. For all accounts, the ingestion of logs was not interrupted and no data was lost.

Why it happened:

Upon investigation, we discovered that the service which parses all incoming log lines was working very slowly.  This service is upstream to all our other services, such as alerting, live tail, archiving, and searching; consequently, all those services were also delayed.

We isolated the slow parsing to the specific content of certain log lines.  These log lines exposed an inefficiency in our line parsing service which resulted in exponential growth in the time needed to parse those lines; this in turn created a bottleneck that delayed the parsing of other log lines.  The inefficiency has been present for some time, but went undetected until one account started sending a large volume of these problematic lines.

How we fixed it:

The line parsing service was updated to use a new algorithm that avoids the worst-case behaviors of the original, as well as improving performance for line parsing in general.

From then on, the parsing service just needed time to process the backlog of logs sent to us by customers.  Likewise, the downstream services – alerting, live tail, archiving, searching – needed time to process the logs now being sent to them by the parsing service.  The recovery was quicker for about 75% of our customers and slower for the other 25%.

What we are doing to prevent it from happening again:

The new parsing methodology has improved our overall performance significantly.  We are also actively pursuing further optimizations.

Posted Nov 30, 2021 - 20:43 UTC

Resolved
This incident has been resolved. All services are fully operational.
Posted Nov 23, 2021 - 02:04 UTC
Monitoring
Our services are mostly back to normal; we are monitoring.
Posted Nov 23, 2021 - 00:51 UTC
Update
We are still actively investigating and working on a fix for the issue.
Posted Nov 22, 2021 - 20:54 UTC
Update
There continue to be delays with processing newly sent log lines. Additionally, some alerts are not triggering.
Posted Nov 22, 2021 - 19:53 UTC
Investigating
We are currently experiencing delays in searching for newly ingested log data, livetail, and alerts at this time. We are investigating and working quickly to mitigate the issue.
Posted Nov 22, 2021 - 19:01 UTC
This incident affected: Log Analysis (Search, Alerting, Livetail).