Web UI and Ingestion are intermittently unavailable. Alerting is halted and new Live Tail sessions can't be started.
Incident Report for Mezmo Status Page
Postmortem

Dates:
Start Time: Tuesday, January 18, 2022, at 21:00:00 UTC
End Time: Wednesday, January 19, 2022, at 05:30:00 UTC
Duration: 8:30:00

What happened:

Our Web UI returned an error when customers tried to login or load pages. The errors persisted for short intervals – about 1-2 minutes each – then returned to normal usage. There were about 20 such intervals over the course of 4+ hours.

The ingestion of logs was also halted during these 1-2 minute intervals. All LogDNA agents running on customer environments quickly resent the logs.

Alerting was halted for the duration of the incident and new sessions of Live Tail could not be started.

Why it happened:

We updated our parser service, which required scaling down all pods and restarting them. A new feature of the parser is to flush memory to our Redis database upon restart. The new flushing worked as intended, but also overwhelmed the database and made it unavailable to other services. This caused the pods running our Web UI and ingestion service to go into a “Not Ready” state; our API gateway then stopped sending traffic to these pods. When customers tried to load pages in the Web UI, the API gateway returned an error.

When the Redis database became unresponsive, our alerting service stopped working and new sessions of Live Tail could not be started. Our monitoring of these services was inadequate and we were not alerted.

How we fixed it:

Restarting the parser server was unavoidable. We split the restart process for the parser into small segments to keep the intervals of unavailability as short as possible. In practice, there were 20 small restarts over 4+ hours, each causing 1-2 minutes of unavailability. The WebUI and the ingestion service were fully operational by January 19, 01:21:00 UTC.

On January 19, 5:30 UTC we manually restarted the Alerting and Live Tail services, which then returned to normal usage.

What we are doing to prevent it from happening again:

We’ve added code to slow down the shutdown process for the parser service to stagger the impact on our Redis database over time. Restarting the parser is uncommon; we intend to run load tests of restarts before any future updates of the parser in production are necessary, to confirm Redis is no longer affected by the new flushing behavior.

We will improve our monitoring to alert us when services like Live Tail and Alerting are not functioning.

Posted Jan 19, 2022 - 19:57 UTC

Resolved
This incident has been resolved. All services are fully operational.
Posted Jan 19, 2022 - 01:21 UTC
Identified
We've identified the source of the failure and are taking action to correct it. The WebUI continues to be unavailable at times for intervals of 1-2 minutes each.
Posted Jan 18, 2022 - 23:45 UTC
Investigating
Our WebUI is under maintenance and may not load pages consistently. We are working to recover as soon as possible.
Posted Jan 18, 2022 - 22:59 UTC
This incident affected: Log Analysis (Web App).