Dates:
Start Time: Saturday, October 26, 2024 at 01:00 UTC
End Time: Saturday, October 26, 2024 at 20:23 UTC
Duration: 19 hours and 23 minutes
(Note that customer impact was limited to the first four hours of the incident.)
What happened:
For approximately two hours, pages on the Web UI for Pipeline did not load. Afterwards, pages were able to load, but often only after delays of 15 seconds or longer. After another two hours, the Web UI returned to normal usage. The incident was kept open until all remediation was completed.
The Web UI for Log Analysis and the ingress and egress of Pipeline data were unaffected.
Why it happened:
On Wednesday, October 23, 2024, we deployed a change to our Pipeline service, by which metrics on pipeline usage began to be saved to a postgres database. The same database also stores account configuration information; this information must be accessed to display pages in the Web UI.
The deployment on Wednesday changed the performance profile of the database, most notably in the number and frequency of writes. There was no immediate customer impact, but we noted that backups of the database were unable to run successfully because of the increased load.
Customer impact only began on Saturday, October 26th when an unrelated user action – running a Profiler from the Web UI – placed even more demands on the postgres database. It was unable to process queries and, according to its design, moved into “read-only” mode to prevent any loss of data. At times the database was entirely inaccessible or very slow to respond. This made our Web UI unable to display pages, as it relies on configuration information stored in the database.
How we fixed it:
We first enabled two replicas of the database, which allowed the Web UI to load pages again, albeit slowly.
We then deployed a new code change that removed all superfluous writes about metric usage to the postgres database. This reduced the number of queries and Web UI usage returned to normal. We kept the incident open while working on further remediation.
Finally we took steps to bring the postgres database back to a normal state. This remediation phase was complicated by the fact that a full backup had not completed successfully in the last two days. By the end of the incident, the database was operating normally and a full backup had completed.
What we are doing to prevent it from happening again:
The problems with the deployment on Wednesday, October 23, 2024 only revealed themselves under the load of our production environment. To better simulate production, we will update our testing environment and processes to use higher data loads.
We will closely evaluate the current workload of the postgres database and consider making a separate database to store just metrics, rather than combining metrics and configurations in one location. Separate databases would have prevented Web UI pages from not loading, thus avoiding any customer impact.
We will re-evaluate our backup strategy for the postgres database, since this slowed down the remediation phase.