Start Time: Monday, February 6, 2023, at 20:05 UTC
End Time: Tuesday, February 7, 2023, at 00:30 UTC
Duration: 4 hours and 25 minutes
Searches returned results slowly or not at all. Our Web UI was intermittently unresponsive, particularly for pages like Live Tail, Graphing, and Timelines. No data was lost and ingestion was not halted.
Why it happened:
We initiated an upgrade of all nodes in our service, including the nodes that store logs. Pods were gradually moved to other nodes and restarted, so as to prevent any interruption in service.
A single pod that stores logs did not restart normally. Upon investigation, we found that it had not shut down cleanly and some files essential to a normal startup had not been written to disk. More significantly, we discovered that all nodes that store logs were using a podManagementPolicy of “orderedReady” (the default setting). This forced pods to restart in an ordered sequence. The single pod that would not restart was in the middle of the sequence; all the pods later in the sequence followed the policy and did not start either. In effect, about 25% of the pods within one zone (out of the three zones devoted to storing logs) were unable to start.
The remaining pods in the zone were forced to take on extra work, such as accepting new logs, compacting data, and answering queries from our internal APIs. This led to slow searches and slow load times for any part of the Web UI that displays data about logs.
How we fixed it:
We temporarily added more pods to run API calls to increase the odds of them succeeding. We changed the podManagementPolicy to “Parallel” to allow all pods to restart, regardless of their position in the ordered sequence for starting up. We made manual edits to the pod that had not restarted cleanly so it could start again. These steps brought search latency back to normal speeds and made API calls work again.
We cordoned off two pods that had fallen far behind in processing to allow them to recover without taking on new tasks. This temporarily removed ~2% of logs from all search results. When these pods were caught up with all pending tasks, we made them available again for search queries.
What we are doing to prevent it from happening again:
We have changed the podManagementPolicy to “Parallel” for all nodes that store logs.
We will review the podManagementPolicy of all other areas of our service and make changes where appropriate.
We will add alerting and monitoring to detect high latency in search speeds and the average time to compact newly inserted logs.
We’ll explore options for adding more resources to each zone of pods, so they are less likely to fall behind on processing tasks when some pods are unavailable.
We’ll explore ways to prevent unclean shutdowns of pods when nodes are upgraded.