Dates:
Start Time: 8:32 pm UTC, Tuesday August 29th, 2023
End Time: 10:04 pm UTC, Tuesday August 29th, 2023
Duration: 92 minutes
What happened:
Our Kong Gateway service stopped functioning and all connection requests to our ingestion service and web service failed. The Web UI did not load and log lines could not be sent by either our agent or API. Log lines sent using syslog were unaffected.
Kong was unavailable for two periods of time: one lasting 27 minutes (8:32 pm UTC to 8:59 pm UTC) and another lasting 9 minutes (9:43 pm UTC to 9:52 pm UTC). Once Kong became available, the Web UI was immediately accessible again. Agents resent locally cached log lines (as did any APIs implemented with retry strategies). Our service then processed the backlog of log lines, passing them to downstream services such as alerting, live tail, archiving, and indexing (which makes lines visible in the Web UI for searching, graphing, and timelines). The extra processing was completed ~20 minutes after Kong returned to normal usage the first time, and ~10 minutes after the second time.
Why it happened:
The pods running our Kong Gateway were overwhelmed with connection requests. CPU increased to a point that health checks started to fail and the pods were shut down. We’ve determined through research and experimentation that the cause was a sudden, brief increase in the volume of traffic directed to our service. Our service is designed to handle increases in traffic, but these were approximately 100 times above normal usage. The source(s) of the traffic are unknown. The increase came in two spikes, which correspond to the two periods when Kong became unavailable.
How we fixed it:
We manually scaled up the number of pods devoted to running our Kong Gateway. During the first spike of traffic, we doubled the number of pods; during the second, we quadrupled the number. This certainly helped speed up the processing of the backlog of log lines sent by agents once Kong was again available. It’s unclear whether the higher number of pods would have been able to process the spikes of traffic as they were happening.
What we are doing to prevent it from happening again:
We are running our Kong service with more pods so there are more resources to handle any similar spikes in traffic. We will add auto-scaling to the Kong service so more pods are made available automatically as needed. We’ll also add metrics to identify the origin of any similar spikes in traffic.