Many services briefly halted due to Cloud Provider incident

Incident Report for Mezmo Status Page

Postmortem

Dates:
Start Time: Thursday, January 20, 2022, at 19:13:00 UTC
End Time: Thursday, January 20, 2022, at 21:24:00 UTC Duration: 02:11:00

What happened:
Ingestion was halted and our Web UI was unresponsive for about 5-10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines.

Why it happened:
Our service hosting provider Equinix Metal had an outage that was caused by the failure of one of their main switches (more details at https://status.equinixmetal.com/incidents/gjmh37y6rkjp). The outage impacted traffic and global network connectivity to the LogDNA service.

During the Equinix Metal incident, Ingestion, Alerting, and Live Tail were halted and our Web UI was unresponsive for a period of 5-10 minutes. Multiple ElasticSearch (ES) clusters went into an unhealthy state which caused delays for about one hour in newly submitted logs being immediately available for Searching, Graphing, and Timelines.

How we fixed it:
No remedial action was possible by LogDNA. We waited until the incident from Equinix Metal, our service hosting provider, was resolved. The ES clusters were repaired and the backlog of newly submitted logs was processed in about one hour.

What we are doing to prevent it from happening again:
For this type of incident, LogDNA cannot take proactive preventive measures.

Posted Jan 21, 2022 - 19:51 UTC

Resolved

This incident has been resolved. All services are operational.

Posted Jan 20, 2022 - 21:38 UTC

Monitoring

Logs are being ingested again without delays. All services are working normally. We will monitor until our Cloud Provider closes their incident.

Posted Jan 20, 2022 - 21:07 UTC

Investigating

Our Cloud Provider Equinix is having an incident (see https://status.equinixmetal.com/incidents/gjmh37y6rkjp). For about 5-10 minutes, ingestion was halted and the WebUI was not responsive. Some alerts may have not been triggered. Currently all services are working and there are some delays in processing recently sent logs. We are monitoring Equinix’s incident closely.

Posted Jan 20, 2022 - 19:57 UTC

This incident affected: Log Analysis (Log Ingestion (Agent/REST API/Code Libraries), Log Ingestion (Heroku), Log Ingestion (Syslog), Web App, Search, Alerting, Livetail).