Intermittent user session timeouts, requiring periodic re-authentication
Incident Report for Mezmo Status Page
Postmortem

Dates: 
Start Time: Monday, December 4, 2023, at 10:29 UTC
End Time: Monday, December 4, 2023, at 12:01 UTC
Duration: 92 minutes

What happened:

Web UI users were logged out frequently – usually within 1-2 minutes of logging in. Users could successfully login again without any issues, but the session would expire shortly afterwards.

Why it happened:

It was identified that both Web UI pods and the Redis database pods, which are responsible for storing user sessions, experienced a critical memory shortage, leading to uncontrolled data purging. When this same issue happened in July 2023, our engineering team deployed a fix that enhanced how Redis stores the user session keys. This fix successfully prevented any recurrence of the problem until today. The team is still determining what made it exceed the memory limit this time.

How we fixed it:

Initially, the Web UI pods were restarted, but that did not resolve the problem permanently. The engineering team then restarted the Redis database pods and the session stopped expiring.

What we are doing to prevent it from happening again:

The team will revise the previous fix, including implementing a mechanism for the pod to automatically restart upon reaching its limit and setting up alerts to notify an engineer when it's approaching that threshold.

Posted Dec 08, 2023 - 09:43 UTC

Resolved
The issue has been resolved, and no further issues have been observed with user sessions.
Posted Dec 04, 2023 - 13:19 UTC
Monitoring
We have implemented a fix for the user session timeouts on the Web UI, but will continue to monitor the situation closely.
Posted Dec 04, 2023 - 12:13 UTC
Investigating
The Web UI is currently encountering user session timeouts, prompting customers to log in every 1-2 minutes. Our team is actively investigating the root cause of this issue, while the remaining aspects of the service remain fully functional.
Posted Dec 04, 2023 - 12:06 UTC
This incident affected: Log Analysis (Web App).