Logs are not searchable in web app

Incident Report for Mezmo Status Page

Postmortem

Dates:
Start Time: Thursday, March 4, 2021, at ~03:45 UTC
End Time: Thursday, March 4, 2021, at ~08:20 UTC
Duration: ~4:36:00

‌

What happened:

Our Web UI returned an error message "Request returned an error. Try again?" when users tried to perform a search query or use Live Tail in the Web UI.

‌

Why it happened:

The pods that run our searching and Live Tail services were automatically terminated by our Kubernetes orchestration system. Upon investigation, we discovered we had inadvertently classed these services as low priority. The incident occurred when a large number of other services that were classed as higher priority needed to run to meet usage demands. The orchestration system automatically terminated the lower priority services to make resources available for the higher priority services.

More specifically, these pods were put into a “terminating” state. Normally this state is temporary -- a transition between “running” and “terminated”. During this incident, the pods remained in the “terminating” state permanently. Our monitoring detects services that have been “terminated”, but not ones that are in the temporary “terminating” state. Consequently, our infrastructure team was not notified.

‌

How we fixed it:

We increased the priority of the pods that run our searching and Live Tail services to match the priority of other services. We updated the configuration of our orchestration system to make the change permanent.

‌

What we are doing to prevent it from happening again:

We’ve already updated the configuration of our orchestration system to give services the correct priority. These changes are permanent and should prevent similar problems in the future.

Posted Mar 25, 2021 - 17:32 UTC

Resolved

This incident has been resolved and logs are searchable in the web app. We'll continue to monitor all services.

Posted Mar 04, 2021 - 08:21 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 04, 2021 - 08:12 UTC

Investigating

We are currently investigating an issue that is rendering our log viewer unavailable at this time.

Posted Mar 04, 2021 - 08:00 UTC

This incident affected: Log Analysis (Web App, Search, Livetail).