Dates:
Start Time: Tuesday, February 8, 2022, at 13:17 UTC
End Time: Tuesday, February 8, 2022, at 14:21 UTC
Duration: 1:04:00
What happened:
Our Web UI was unresponsive for about 10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. No data was lost and ingestion was not halted.
Why it happened:
Our Redis database had a failover and the services that depend on it were unable to reconnect after it recovered, including the Parser. This service is upstream of many other services. Consequently, newly submitted logs were not passed on to many downstream services, such as Alerting, Live Tail, Searching, Graphing, and Timelines. The WebUI was also intermittently unavailable because it requires a connection to Redis.
How we fixed it:
We manually restarted the Redis service which allowed a new master to be elected. After Redis recovered, the Parser, Web UI and other services were restarted which were then able to reestablish a connection to Redis. This restored the Web UI and allowed newly submitted logs to pass from our Parser service to all downstream services. Over a short period of time, these services processed the backlog of logs and newly submitted logs were again available without delays.
What we are doing to prevent it from happening again:
We recently added functionality to track the flow rate of newly submitted logs. This new feature requires more memory than expected in the event of a Redis failover, which is why services could not reconnect to Redis. We’ve increased the limits of the memory buffer for the relevant portions of our service.
We will also add additional Redis monitoring to more quickly detect unhealthy sentinels and continue to work on an ongoing project to make all services more tolerant of Redis failovers.