Service Unavailable
Incident Report for Mezmo Status Page
Postmortem

Dates:

The incident was opened on December 17, 2020 - 23:29 UTC.
Our service was fully operational by December 18, 2020 - 12:30 UTC.
The incident was officially closed on December 20, 2020 - 03:49 UTC.

What happened:

All services were unavailable for about eight hours. For an additional four hours, services were available but there were significant delays in searching, graphing, and timelines for newly submitted logs.

Additionally, all logs submitted during the first six hours of the incident were never processed by our service and were unavailable in the UI, even after our service was fully operational.

Why it happened:

Our hosting provider had a major power failure that lasted almost five hours. The hardware that our service runs on was unavailable and none of our services could operate.

More details: https://status.equinixmetal.com/incidents/pfgmgy1fnjcp

How we fixed it:

Once our provider was back online, we gradually restarted all our services. This took time and manual intervention because all our services had been taken down ungracefully by the outage.

Around December 18, 2020 - 07:54 UTC, services became operational and logs began to be ingested again. Since no logs had been ingested for about eight hours, our service had a large backlog to process. As it caught up, users experienced delays in searching, graphing, and timelines for newly submitted logs. The backlog was fully processed around December 18, 2020 - 12:30 UTC and services were once again fully operational.

Logs submitted during the first six hours of the incident (around December 17, 2020, 23:00 UTC to December 18, 2020, 5:00 UTC) remained unavailable in the UI. Normally, if our service is temporarily unavailable, logs can be resubmitted and successfully processed. In this case, the sudden loss of power brought down our services ungracefully, abruptly interrupting write operations as we processed logs. This resulted in partial writes and bad writes, which made our service unable to determine, for the resubmitted logs, where log lines began. In effect, this made logs resubmitted from that six hour period of time unreadable and unable to be processed.

The incident was kept open as we made attempts to read and process these logs, but these efforts were ultimately unsuccessful.

After the incident was closed, we developed the means to restore archives of these logs to all customers with version 3 of archiving enabled. The restoration of archives is expected to begin on the week of January 18th.

What we are doing to prevent it from happening again:

We are developing changes to how we write logs so that in a similar event our service will not lose track of the start of log lines and be able to read and process them.

Posted Jan 14, 2021 - 22:35 UTC

Resolved
All services are operational. Most logs sent on December 17th for the six hours between 6 pm ET and midnight ET are not available in the UI. Although this incident is now closed, we will continue to work to make archives of these logs available to customers who chose to enable the archiving feature.
Posted Dec 20, 2020 - 03:49 UTC
Update
All services are operational. We continue to work on making logs sent during our provider’s outage (from approximately 23:00 UTC to 3:00 UTC) available in our UI.
Posted Dec 18, 2020 - 20:48 UTC
Monitoring
Service has now been restored. We are monitoring the environment closely at this time. Logs sent during our provider’s outage (from approximately 23:00 UTC to 3:00 UTC) are still unavailable in our UI.
Posted Dec 18, 2020 - 15:36 UTC
Update
Ingestion of new logs is working normally. Logs sent to our service since about 3:00 UTC have now been ingested and are available for searching and timelines. Logs sent during our provider’s outage (from approximately 23:00 UTC to 3:00 UTC) are still unavailable in our UI.
Posted Dec 18, 2020 - 12:30 UTC
Update
Ingestion of new logs is working normally. Logs sent to our service since about 3:00 UTC have mostly been ingested and are mostly available for searching and timelines. Logs sent during our provider’s outage (from approximately 23:00 UTC to 3:00 UTC) are still unavailable in our UI.
Posted Dec 18, 2020 - 09:20 UTC
Update
New logs are being ingested again, although there is a large backlog to process. Searching, timelines, and alerting based on newly sent logs will be delayed. Live tail is working normally. Logs sent during our provider’s outage (from approximately 23:00 UTC to 3:00 UTC) are still unavailable in our UI.
Posted Dec 18, 2020 - 07:54 UTC
Update
We continue to make progress on restoring our service to full functionality. Please note logs may be unavailable in the web app until we have fully recovered.
Posted Dec 18, 2020 - 04:47 UTC
Update
We continue to make progress on restoring our service to full functionality. Please note logs may be unavailable in the web app until we have fully recovered.
Posted Dec 18, 2020 - 03:46 UTC
Update
We continue to make progress on restoring our service to full functionality. Please note logs may be unavailable in the web app until we have fully recovered.
Posted Dec 18, 2020 - 03:33 UTC
Update
Our provider has now fully recovered from their incident. We have begun bringing our services back online. Please note logs may be unavailable in the web app until we have fully recovered.
Posted Dec 18, 2020 - 02:10 UTC
Update
Our provider has almost completely recovered from their incident. We are preparing to restart our own services.
Posted Dec 18, 2020 - 01:20 UTC
Identified
All Services are unavailable due to an incident with our hosting provider. More information can be found here https://status.equinixmetal.com/
Posted Dec 17, 2020 - 23:40 UTC
Investigating
We are currently investigating an issue that is rendering our service unavailable at this time.
Posted Dec 17, 2020 - 23:29 UTC
This incident affected: Log Analysis (Log Ingestion (Agent/REST API/Code Libraries), Log Ingestion (Heroku), Log Ingestion (Syslog), Web App, Search, Alerting, Livetail, Archiving).