Start Time: Thursday, August 19, 2021, at, 13:56 UTC
End Time: Thursday, August 19, 2021, at, 20:48 UTC
Duration: 6:52:00
What happened:
Searches in our Web UI and Live Tail were intermittently failing. Additionally, for a small set of customers (about 12%), there were delays in newly submitted logs being available for searching, graphing, and timelines.
Why it happened:
Our service uses the Calico networking solution to ensure network level connectivity between all nodes and the pods running on them. On several nodes, Calico stopped running. This put one of our ElasticSearch clusters into an unhealthy red state. For customers using this cluster (about 12% of all our customers), there were delays in newly submitted logs being available for searching, graphing, and timelines. When Calico stopped running on some nodes, it also led to failures with tribe nodes, which make searching across multiple clusters possible. This caused intermittent failures in searching and live tail, for all customers.
How we fixed it:
We took remedial action by restarting Calico and restoring networking connections between all nodes. We also restarted our tribe nodes, repaired the ElasticSearch cluster that was in a red state, and then provided temporary resources so our service could more quickly process the backlog of logs sent by customers since the beginning of the incident.
What we are doing to prevent it from happening again:
We are investigating why Calico stopped working on several nodes. We’re also updating our runbooks to recover more quickly in similar situations and limit any customer impact.