Start Time: Monday, May 1, 2023, at 19:55 UTC
End Time: Monday, May 1, 2023, at 20:11 UTC
Duration: 16 minutes
The WebUI was unresponsive, returning an error of “failure to get a peer from the ring-balancer.”
Why it happened:
All Mezmo services run within a service mesh. The portion of the mesh dedicated to the pods running our Mongo database began receiving many connection requests, more than its allocated CPU and memory could handle at once. This portion of the mesh (which itself runs on pods) quickly ran out of memory. This made the Mongo database unavailable to other services. The WebUI relies entirely on Mongo for account information and therefore became unresponsive, returning an error of “failure to get a peer from the ring-balancer.”
While the immediate reason for the incident is clear, the root cause is still unknown. We suspect there was a change in user usage patterns (e.g. increased traffic, login attempts, etc) which triggered the incident.
How we fixed it:
We removed the WebUI from the service mesh. The Mongo service has more CPU and memory resources allocated to it and was able to accept the high level of connection requests successfully. WebUI usage immediately returned to normal.
What we are doing to prevent it from happening again:
We will change the default settings for the service mesh to allocate more CPU and memory resources, permanently. Afterwards, we will add the Mongo service back to the service mesh.