Dates:
Start Time: Thursday, June 30, 21:40 UTC
End Time: Thursday, June 30, 23:32 UTC
Duration: 1 hour and 52 minutes
What happened:
Some log lines for some customers were discarded by our service. The log lines were successfully accepted by our ingestion service, but a downstream service – the parser – removed some of them. All further downstream services, such as Alerting, Live Tail, Searching, and Archiving never received these logs. In some cases, lines were received by Live Tail and were appended with the phrase “(not retained)”.
The great majority of customers – 94.2% – were unaffected and had no log lines discarded. Approximately 3.5% had a relatively small number of log lines discarded. Approximately 2.3% had most or all of the log lines submitted during the incident discarded.
Why it happened:
We inadvertently released code into production that contained a bug in the parser service. This bug was known to us and in the process of being fixed in our development environment, but was not yet ready for release to production.
The parser service is where exclusion rules are applied to recently submitted log lines that have been ingested but not yet passed to downstream services (e.g. Alerting, Live Tail, Searching, and Archiving). The bug made the parser exclude log lines that matched rules for inactive exclusion rules.
This included exclusion rules made by customers in the past and then disabled. Customers with such rules had some log lines excluded: whichever lines matched the inactive rules. If those rules had the “Preserve these lines for live-tail and alerting” option enabled, then the excluded lines would still be processed for alerts and appear in Live Tail with the phrase “(not retained)” appended. This affected 3.5% of our customer accounts.
The usage quota feature is implemented as a particular type of exclusion rule even though it is not presented in the UI as an exclusion rule. The bug made the parser exclude all log lines if the usage quota feature was enabled for an account. This affected 2.3% of our customer accounts.
Our monitoring did not detect the decrease in lines being passed from the parser to downstream services because the change was within the range of normal fluctuation rates. These rates vary significantly as traffic changes and as customers choose to enable/disable exclusion rules.
How we fixed it:
We reverted the last release of parser code to the previous version. Once the previous version was deployed to all pods running the parser service, log lines stopped being discarded.
What we are doing to prevent it from happening again:
We added a code level test to ensure inactive exclusion rules are never applied by the parser (such tests are part of our standard operating procedure).
We will review our release process to understand how the code containing the bug was moved into production and improve our processes to prevent a similar event in the future.