Dates:
Start Time: Saturday, February 26, 2022, at 19:51 UTC
End Time: Sunday, February 27, 2022, at 22:13 UTC
Duration: 26:22:00
What happened:
Ingestion of new logs to our Syslog endpoint – for logs sent using a custom port, only – was intermittently delayed.
Why it happened:
We recently introduced a new service (Syslog Forwarder) to handle the ingestion of logs sent over Syslog. As the name implies, it forwards logs to downstream services. Logs are sent from a range of ports on Syslog Forwarder to a range of ports used by clients running on downstream services. This design worked well in our advance testing, using a limited number of custom ports.
Once running in production, however, the Syslog Forwarder needed to connect to a much larger number of custom ports. We then saw that the ephemeral port ranges of the clients running on downstream services overlapped with the port ranges used by the Syslog Forwarder. This led to occasional port conflicts when services and/or clients tried to start. The services and/or clients would attempt to start again until they found an open port without conflicts. This created delays in ingestion.
How we fixed it:
We changed the ephemeral port ranges of the clients running on downstream services so they no longer overlapped with the port ranges used by the Syslog Forwarder.
What we are doing to prevent it from happening again:
The new ephemeral port range has been incorporated and proven resilient in production. No further work is needed to prevent this kind of incident from happening again.