Start Time: Wednesday, February 17, 2022, at 20:56 UTC
End Time: Thursday, February 18, 2022, at 02:15 UTC
The ingestion of new logs to our Syslog endpoint was intermittently failing.
Why it happened:
We made a code change to the area of our service (Syslog Forwarder) that handles the ingestion of logs sent by Syslog and inadvertently changed how memory is managed. Routine memory garbage collection stopped and memory usage increased on the pods that accept newly submitted log lines over Syslog. Eventually, the increase in memory caused the pods to crash. Any log lines held on those pods were lost and never ingested.
How we fixed it:
We reverted to the previous version of the Syslog Forwarder service. This stopped the pods from crashing.
We then resolved the memory management issue in our code. The new, fixed version was released to production shortly thereafter and performed as expected.
What we are doing to prevent it from happening again:
We have added regression tests to the Syslog Forwarder service to prevent a similar mistake in the future.