Degraded performance for WebUI, Ingestion, Alerting, Searching, Live Tail, Graphing, and Timelines

Incident Report for Mezmo Status Page

Postmortem

Dates:
Start Time: Wednesday, October 5, 2022, at 14:27 UTC
End Time: Wednesday, October 5, 2022, at 14:45 UTC
Duration: 00:18

What happened:

The ingestion of logs was partially halted. The WebUI was mostly unresponsive and most API calls failed. Because many newly submitted logs were not being ingested, new logs were not immediately available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving.

Why it happened:

We recently added a new API gateway - Kong - to our service, that acts as a proxy for all other services. We had gradually increased the amount of traffic directed through the API gateway over several weeks and seen no ill effects. Prior to the incident, only some of the traffic for ingestion wen through the gateway.

Kong was restarted after a routine configuration change. After the restart, all traffic for our ingestion service began to go through Kong. Our monitoring quickly revealed the Kong service did not have enough pods to keep up with the increased workload, causing many requests to fail.

How we fixed it:

We manually added more pods to the Kong service. Ingestion, the WebUI, and API calls began to work normally again. Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost.

What we are doing to prevent it from happening again:

We updated Kubernetes to always assign enough pods for the Kong API gateway service to be able to handle all traffic.

We’ll update the Kong gateway to more evenly distribute ingestion traffic across available pods.

We will adjust our deployment processes so pods are restarted more slowly, which will reduce the impact in a similar scenario.

We’ll explore autoscaling policies so more pods could be added automatically in a similar situation.

Posted Oct 12, 2022 - 18:53 UTC

Resolved

This incident has been resolved. All services are fully operational.

Posted Oct 05, 2022 - 16:05 UTC

Monitoring

Service is restored but we are still monitoring.

Posted Oct 05, 2022 - 14:58 UTC

This incident affected: Log Analysis (Log Ingestion (Agent/REST API/Code Libraries), Log Ingestion (Heroku), Log Ingestion (Syslog), Web App, Search, Alerting, Livetail).