Start Time: Wednesday, January 26, 2022, at 15:45:00 UTC
End Time: Wednesday, January 26, 2022, at 16:30:00 UTC
Ingestion was halted and newly submitted logs were not immediately available for Alerting, Live Tail, Searching, Graphing, and Timelines. Some alerts were never triggered.
Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost.
Why it happened:
Our Redis database had a failover and the services that depend on it were unable to recover automatically. Normally, the pods running our ingestion service deliberately crash until they are able to access Redis again. However, these pods were in a bad state and unable to reconnect when Redis returned.
Since ingestion was halted, newly submitted logs were not passed on to many downstream services, such as Alerting, Live Tail, Searching, Graphing, and Timelines.
How we fixed it:
We manually restarted all the pods of our ingestion service, then restarted all the sentinel pods of Redis. The ingestion service became operational again and logs were passed on to all downstream services. Over a short period of time, these services processed the backlog of logs and newly submitted logs were again available without delays.
What we are doing to prevent it from happening again:
The ingestion pods were in a bad state because they had not been restarted after a configuration change made several days earlier, for reasons unrelated to this incident. The runbook for making such configuration changes has been updated to prevent this procedural failure in the future.
We’re also in the middle of a project to make all services more tolerant of Redis failovers.