Pipeline Web UI is Unresponsive

Incident Report for Mezmo Status Page

Postmortem

Dates:
Start Time: Saturday, October 26, 2024 at 01:00 UTC
End Time: Saturday, October 26, 2024 at 20:23 UTC
Duration: 19 hours and 23 minutes

(Note that customer impact was limited to the first four hours of the incident.)

‌

What happened:

For approximately two hours, pages on the Web UI for Pipeline did not load. Afterwards, pages were able to load, but often only after delays of 15 seconds or longer. After another two hours, the Web UI returned to normal usage. The incident was kept open until all remediation was completed.

The Web UI for Log Analysis and the ingress and egress of Pipeline data were unaffected.

‌

Why it happened:

On Wednesday, October 23, 2024, we deployed a change to our Pipeline service, by which metrics on pipeline usage began to be saved to a postgres database. The same database also stores account configuration information; this information must be accessed to display pages in the Web UI.

The deployment on Wednesday changed the performance profile of the database, most notably in the number and frequency of writes. There was no immediate customer impact, but we noted that backups of the database were unable to run successfully because of the increased load.

Customer impact only began on Saturday, October 26th when an unrelated user action – running a Profiler from the Web UI – placed even more demands on the postgres database. It was unable to process queries and, according to its design, moved into “read-only” mode to prevent any loss of data. At times the database was entirely inaccessible or very slow to respond. This made our Web UI unable to display pages, as it relies on configuration information stored in the database.

‌

How we fixed it:

We first enabled two replicas of the database, which allowed the Web UI to load pages again, albeit slowly.

We then deployed a new code change that removed all superfluous writes about metric usage to the postgres database. This reduced the number of queries and Web UI usage returned to normal. We kept the incident open while working on further remediation.

Finally we took steps to bring the postgres database back to a normal state. This remediation phase was complicated by the fact that a full backup had not completed successfully in the last two days. By the end of the incident, the database was operating normally and a full backup had completed.

‌

What we are doing to prevent it from happening again:

The problems with the deployment on Wednesday, October 23, 2024 only revealed themselves under the load of our production environment. To better simulate production, we will update our testing environment and processes to use higher data loads.

We will closely evaluate the current workload of the postgres database and consider making a separate database to store just metrics, rather than combining metrics and configurations in one location. Separate databases would have prevented Web UI pages from not loading, thus avoiding any customer impact.

We will re-evaluate our backup strategy for the postgres database, since this slowed down the remediation phase.

Posted Oct 31, 2024 - 20:34 UTC

Resolved

The Pipeline UI is now fully functional.

Posted Oct 26, 2024 - 20:23 UTC

Update

The Pipeline WebUI is available and loading pages normally. We're working to resolve the root cause permanently and continuing to monitor.

Posted Oct 26, 2024 - 18:34 UTC

Update

The Pipeline WebUI is available, but at times pages are slow to load and metrics may be unavailable. We are taking remedial action and continuing to monitor.

Posted Oct 26, 2024 - 05:24 UTC

Update

We are continuing to monitor for any further issues.

Posted Oct 26, 2024 - 03:35 UTC

Monitoring

The Pipeline WebUI is working now. We are monitoring.

Posted Oct 26, 2024 - 03:35 UTC

Update

The Pipeline WebUI is still unavailable. Ingress and egress are unaffected. Our engineers are investigating.

Posted Oct 26, 2024 - 02:12 UTC

Investigating

Our Pipeline WebUI is not loading pages. We are investigating.

Posted Oct 26, 2024 - 01:00 UTC

This incident affected: Pipeline (Web App).