Start Time: Thursday, February 18, 2022, at 00:10 UTC
End Time: Thursday, February 24, 2022, at 23:43 UTC
The ingestion of new logs to our Syslog endpoint was intermittently failing.
Why it happened:
We recently introduced a new service (Syslog Forwarder) to handle the ingestion of logs sent over Syslog. As the name implies, it forwards logs to downstream services. It was designed to send all logs submitted for each account to a single port opened on the downstream services. No load balancing was implemented in our original design, which performed well in our advance testing.
Once put into production, however, it became apparent that some customer accounts submit logs at a volume higher than the downstream services could process. When this happened, logs lines were buffered in memory by the Syslog Forwarder. Memory increased until the pods crashed. Any log lines held on those pods were lost and never ingested.
How we fixed it:
We improved the design of the Syslog Forwarder by adding a pool of connections to the downstream services. In effect, we added traffic shaping to the Syslog Forwarder.
What we are doing to prevent it from happening again:
The new architecture has been incorporated and proven resilient in production. No further work is needed to prevent this kind of incident from happening again.