Ingestion, Searching, Live Tail, Alerting, Graphing, and Timelines Delays
Incident Report for Mezmo Status Page
Postmortem

Dates:
Start Time: Wednesday, January 26, 2022, at 15:45:00 UTC
End Time: Wednesday, January 26, 2022, at 16:30:00 UTC
Duration: 00:45:00

What happened:

Ingestion was halted and newly submitted logs were not immediately available for Alerting, Live Tail, Searching, Graphing, and Timelines. Some alerts were never triggered.

Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost.

Why it happened:

Our Redis database had a failover and the services that depend on it were unable to recover automatically. Normally, the pods running our ingestion service deliberately crash until they are able to access Redis again. However, these pods were in a bad state and unable to reconnect when Redis returned.

Since ingestion was halted, newly submitted logs were not passed on to many downstream services, such as Alerting, Live Tail, Searching, Graphing, and Timelines.

How we fixed it:

We manually restarted all the pods of our ingestion service, then restarted all the sentinel pods of Redis. The ingestion service became operational again and logs were passed on to all downstream services. Over a short period of time, these services processed the backlog of logs and newly submitted logs were again available without delays.

What we are doing to prevent it from happening again:

The ingestion pods were in a bad state because they had not been restarted after a configuration change made several days earlier, for reasons unrelated to this incident. The runbook for making such configuration changes has been updated to prevent this procedural failure in the future.

We’re also in the middle of a project to make all services more tolerant of Redis failovers.

Posted Jan 31, 2022 - 20:16 UTC

Resolved
This incident has been resolved. All services are operational.
Posted Jan 26, 2022 - 17:15 UTC
Monitoring
We have implemented a fix and are monitoring the results. Logs are being ingested again and all services are operational.
Posted Jan 26, 2022 - 16:58 UTC
Update
We are continuing to investigate this issue.
Posted Jan 26, 2022 - 16:23 UTC
Investigating
Ingestion services are currently halted. Customers will also experience delays with Searching, Live Tail, Alerting, Graphing, and Timelines.
Posted Jan 26, 2022 - 16:10 UTC
This incident affected: Log Analysis (Log Ingestion (Agent/REST API/Code Libraries), Log Ingestion (Heroku), Log Ingestion (Syslog), Web App, Search, Alerting, Livetail).