Dates:
The incident was opened on December 17, 2020 - 23:29 UTC.
Our service was fully operational by December 18, 2020 - 12:30 UTC.
The incident was officially closed on December 20, 2020 - 03:49 UTC.
What happened:
All services were unavailable for about eight hours. For an additional four hours, services were available but there were significant delays in searching, graphing, and timelines for newly submitted logs.
Additionally, all logs submitted during the first six hours of the incident were never processed by our service and were unavailable in the UI, even after our service was fully operational.
Why it happened:
Our hosting provider had a major power failure that lasted almost five hours. The hardware that our service runs on was unavailable and none of our services could operate.
More details: https://status.equinixmetal.com/incidents/pfgmgy1fnjcp
How we fixed it:
Once our provider was back online, we gradually restarted all our services. This took time and manual intervention because all our services had been taken down ungracefully by the outage.
Around December 18, 2020 - 07:54 UTC, services became operational and logs began to be ingested again. Since no logs had been ingested for about eight hours, our service had a large backlog to process. As it caught up, users experienced delays in searching, graphing, and timelines for newly submitted logs. The backlog was fully processed around December 18, 2020 - 12:30 UTC and services were once again fully operational.
Logs submitted during the first six hours of the incident (around December 17, 2020, 23:00 UTC to December 18, 2020, 5:00 UTC) remained unavailable in the UI. Normally, if our service is temporarily unavailable, logs can be resubmitted and successfully processed. In this case, the sudden loss of power brought down our services ungracefully, abruptly interrupting write operations as we processed logs. This resulted in partial writes and bad writes, which made our service unable to determine, for the resubmitted logs, where log lines began. In effect, this made logs resubmitted from that six hour period of time unreadable and unable to be processed.
The incident was kept open as we made attempts to read and process these logs, but these efforts were ultimately unsuccessful.
After the incident was closed, we developed the means to restore archives of these logs to all customers with version 3 of archiving enabled. The restoration of archives is expected to begin on the week of January 18th.
What we are doing to prevent it from happening again:
We are developing changes to how we write logs so that in a similar event our service will not lose track of the start of log lines and be able to read and process them.