Dates:
Start Time: Thursday, March 4, 2021, at ~03:45 UTC
End Time: Thursday, March 4, 2021, at ~08:20 UTC
Duration: ~4:36:00
What happened:
Our Web UI returned an error message "Request returned an error. Try again?" when users tried to perform a search query or use Live Tail in the Web UI.
Why it happened:
The pods that run our searching and Live Tail services were automatically terminated by our Kubernetes orchestration system. Upon investigation, we discovered we had inadvertently classed these services as low priority. The incident occurred when a large number of other services that were classed as higher priority needed to run to meet usage demands. The orchestration system automatically terminated the lower priority services to make resources available for the higher priority services.
More specifically, these pods were put into a “terminating” state. Normally this state is temporary -- a transition between “running” and “terminated”. During this incident, the pods remained in the “terminating” state permanently. Our monitoring detects services that have been “terminated”, but not ones that are in the temporary “terminating” state. Consequently, our infrastructure team was not notified.
How we fixed it:
We increased the priority of the pods that run our searching and Live Tail services to match the priority of other services. We updated the configuration of our orchestration system to make the change permanent.
What we are doing to prevent it from happening again:
We’ve already updated the configuration of our orchestration system to give services the correct priority. These changes are permanent and should prevent similar problems in the future.