Dates:
Start Time: Thursday, January 20, 2022, at 19:13:00 UTC
End Time: Thursday, January 20, 2022, at 21:24:00 UTC Duration: 02:11:00
What happened:
Ingestion was halted and our Web UI was unresponsive for about 5-10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines.
Why it happened:
Our service hosting provider Equinix Metal had an outage that was caused by the failure of one of their main switches (more details at https://status.equinixmetal.com/incidents/gjmh37y6rkjp). The outage impacted traffic and global network connectivity to the LogDNA service.
During the Equinix Metal incident, Ingestion, Alerting, and Live Tail were halted and our Web UI was unresponsive for a period of 5-10 minutes. Multiple ElasticSearch (ES) clusters went into an unhealthy state which caused delays for about one hour in newly submitted logs being immediately available for Searching, Graphing, and Timelines.
How we fixed it:
No remedial action was possible by LogDNA. We waited until the incident from Equinix Metal, our service hosting provider, was resolved. The ES clusters were repaired and the backlog of newly submitted logs was processed in about one hour.
What we are doing to prevent it from happening again:
For this type of incident, LogDNA cannot take proactive preventive measures.