Dates:
Start Time: Tuesday, April 5, 2022, at 13:20:00 UTC
End Time: Tuesday, April 5, 2022, at 18:20:00 UTC
Duration: 5:00:00
What happened:
Alerting was halted for all accounts for the entire duration of the incident. Most alerts – any whose trigger date was more than 15 minutes in the past – were discarded.
Why it happened:
We restarted our parser service, for reasons unrelated to this incident. Any restart of the parser service should be followed by a restart of the alerting service. This second step was overlooked and didn’t happen. Subsequently, all alerts stopped triggering.
The need to restart alerting after a restart of the parser is already documented and well-known to our infrastructure team. However, the restart of the parser was performed by a team less familiar with the correct procedure.
How we fixed it:
We manually restarted the alerting service, which then returned to normal operation.
What we are doing to prevent it from happening again:
The proper documented restart procedure has been discussed with all teams allowed to restart services.
We will add monitoring of our alerting service and automated notifications so we learn more quickly of any similar incidents in the future.