Dates:
Start Time: Monday, December 4, 2023, at 10:29 UTC
End Time: Monday, December 4, 2023, at 12:01 UTC
Duration: 92 minutes
What happened:
Web UI users were logged out frequently – usually within 1-2 minutes of logging in. Users could successfully login again without any issues, but the session would expire shortly afterwards.
Why it happened:
It was identified that both Web UI pods and the Redis database pods, which are responsible for storing user sessions, experienced a critical memory shortage, leading to uncontrolled data purging. When this same issue happened in July 2023, our engineering team deployed a fix that enhanced how Redis stores the user session keys. This fix successfully prevented any recurrence of the problem until today. The team is still determining what made it exceed the memory limit this time.
How we fixed it:
Initially, the Web UI pods were restarted, but that did not resolve the problem permanently. The engineering team then restarted the Redis database pods and the session stopped expiring.
What we are doing to prevent it from happening again:
The team will revise the previous fix, including implementing a mechanism for the pod to automatically restart upon reaching its limit and setting up alerts to notify an engineer when it's approaching that threshold.