Search and Live Tail Intermittent Failures

Incident Report for Mezmo Status Page

Postmortem

Start Time: Thursday, August 19, 2021, at, 13:56 UTC

End Time: Thursday, August 19, 2021, at, 20:48 UTC

Duration: 6:52:00

What happened:

Searches in our Web UI and Live Tail were intermittently failing. Additionally, for a small set of customers (about 12%), there were delays in newly submitted logs being available for searching, graphing, and timelines.

Why it happened:

Our service uses the Calico networking solution to ensure network level connectivity between all nodes and the pods running on them. On several nodes, Calico stopped running. This put one of our ElasticSearch clusters into an unhealthy red state. For customers using this cluster (about 12% of all our customers), there were delays in newly submitted logs being available for searching, graphing, and timelines. When Calico stopped running on some nodes, it also led to failures with tribe nodes, which make searching across multiple clusters possible. This caused intermittent failures in searching and live tail, for all customers.

‌

How we fixed it:

We took remedial action by restarting Calico and restoring networking connections between all nodes. We also restarted our tribe nodes, repaired the ElasticSearch cluster that was in a red state, and then provided temporary resources so our service could more quickly process the backlog of logs sent by customers since the beginning of the incident.

‌

What we are doing to prevent it from happening again:

We are investigating why Calico stopped working on several nodes. We’re also updating our runbooks to recover more quickly in similar situations and limit any customer impact.

Posted Aug 20, 2021 - 17:08 UTC

Resolved

Searching and live tail are working as expected. All services are fully operational.

Posted Aug 19, 2021 - 20:48 UTC

Identified

We’ve identified the immediate cause and are taking remedial action. Searches in our Web UI and Live Tail are working better now, but are still failing at times.

Posted Aug 19, 2021 - 17:30 UTC

Update

Our engineers are taking steps to mitigate the impact as we identify the root cause.

Posted Aug 19, 2021 - 15:31 UTC

Investigating

Searches in our Web UI and Live Tail are failing intermittently, for some customers. We are investigating.

Posted Aug 19, 2021 - 13:56 UTC

This incident affected: Log Analysis (Search, Livetail).