Start Time: Monday, November 8, 2021, at 23:28 UTC
End Time: Tuesday, November 9, 2021, at 00:16 UTC
Duration: 0:48:00
What happened:
Our Web UI returned the error message “This site can’t be reached” when users tried to login or load pages. The ingestion of logs was unaffected.
Why it happened:
The node our web service was running on had a failure with its network management software and became unreachable. Furthermore, the web service was only running on a single node, which is atypical – usually it runs on multiple nodes at once to improve performance and allow for redundancy. Both conditions were necessary for the Web UI to become unavailable.
How we fixed it:
We moved the web service to another node with functioning network management software, which made the Web UI available again. Later, we restarted the unreachable node, which restored it to normal usage.
What we are doing to prevent it from happening again:
We expect both necessary conditions – the failure of the network management software and that the web service was running on a single node – to be resolved by an already planned migration of our entire service to a new cloud-based environment.
We are currently building monitoring of the availability of our Web UI so we can learn of any future failures as soon as possible.