Delays in Alerting, Searching, Live Tail, Graphing, and Timelines. WebUI intermittently unavailable.

Incident Report for Mezmo Status Page

Postmortem

Dates:
Start Time: Tuesday, February 8, 2022, at 13:17 UTC
End Time: Tuesday, February 8, 2022, at 14:21 UTC
Duration: 1:04:00

‌

What happened:

Our Web UI was unresponsive for about 10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. No data was lost and ingestion was not halted.

‌

Why it happened:

Our Redis database had a failover and the services that depend on it were unable to reconnect after it recovered, including the Parser. This service is upstream of many other services. Consequently, newly submitted logs were not passed on to many downstream services, such as Alerting, Live Tail, Searching, Graphing, and Timelines. The WebUI was also intermittently unavailable because it requires a connection to Redis.

‌

How we fixed it:

We manually restarted the Redis service which allowed a new master to be elected. After Redis recovered, the Parser, Web UI and other services were restarted which were then able to reestablish a connection to Redis. This restored the Web UI and allowed newly submitted logs to pass from our Parser service to all downstream services. Over a short period of time, these services processed the backlog of logs and newly submitted logs were again available without delays.

‌

What we are doing to prevent it from happening again:

We recently added functionality to track the flow rate of newly submitted logs. This new feature requires more memory than expected in the event of a Redis failover, which is why services could not reconnect to Redis. We’ve increased the limits of the memory buffer for the relevant portions of our service.

We will also add additional Redis monitoring to more quickly detect unhealthy sentinels and continue to work on an ongoing project to make all services more tolerant of Redis failovers.

Posted Feb 15, 2022 - 19:25 UTC

Resolved

This incident has been resolved.

Posted Feb 08, 2022 - 14:40 UTC

Monitoring

We have implemented a fix and are monitoring the results. Newly sent logs are being processed again with minimal delays and all services are operational.

Posted Feb 08, 2022 - 14:25 UTC

Update

We are continuing to investigate the issue. The web app can be accessible intermittently. Logs can arrive with a delay which will impact searching and alerting.

Posted Feb 08, 2022 - 13:55 UTC

Investigating

We are currently investigating the issue.

Posted Feb 08, 2022 - 13:38 UTC

This incident affected: Log Analysis (Web App, Search, Alerting, Livetail).