Dates:
Start Time: Monday, December 9, 2024 at 17:32 UTC
End Time: Monday, December 9, 2024 at 20:28 UTC
Duration: 2 hours and 56 minutes
What happened:
For some accounts, data sent to Mezmo Pipelines was slow to be processed and sent onwards to their destination, or was not processed at all for the duration of the incident.
Why it happened:
We released a new version of the agent (3.10.1) that improves how logs lines are sent to the Mezmo service for ingestion. Most applications add newly written log lines by a process known as “appending”; by contrast, a small number use a process called “truncation”. The new agent version has improved its ability to handle log lines added to logs using truncation, particularly when log lines are written frequently.
Many accounts do not monitor any logs that use truncation. A few accounts do, but they write new log lines infrequently. However, a handful of customer accounts have applications that write log lines using truncation at a very high frequency. When these accounts upgraded to agent 3.10.1, there was a very large increase in the volume of data sent to the Mezmo Pipeline service.
The increase was detected by our monitoring. It also caused some pods in some Vector partitions to crash, which affected all accounts on the same partition. Data was still ingested and cached within our service, but it was not being quickly processed or sent on to destinations for all accounts.
How we fixed it:
We temporarily paused the processing of newly ingested data, which stopped pods from crashing. We then identified the accounts (approximately 20) that were sending us an increased volume of data and moved them to a newly created Vector partition on a newly commissioned node. This allowed the accounts on all other Vector partitions to function normally again; we restarted Pipeline processing on their pods and data began to flow to destinations again.
For the remaining affected accounts, we applied exclusion rules within our service to remove redundant data and thereby reduce the overall volume of data. We also contacted the owners of the handful of accounts sending log lines at very high volumes and helped them apply exclusion rules within their locally running Mezmo agents. With these changes, the overall volume of ingested data was reduced and again able to be processed and sent to destinations. After monitoring these accounts and seeing no ill effects, we moved them back to the general pool of Vector partitions.
What we are doing to prevent it from happening again:
We discovered our Vector partitions are configured to use more CPU cores than necessary; under high load, the increased CPU usage causes pods to crash. Accordingly, we will limit the number of cores available to Vector per node.
We will rebalance the number of accounts assigned to Vector partitions, aiming to assign less accounts to each one. This will reduce the impact of any similar incidents in the future.
We will explore how to rate limit data sent to Pipelines by individual agents in the future. Rate-limiting will prevent any impact in a similar incident in the future.