Alerts not triggering
Incident Report for Mezmo Status Page
Postmortem

Dates:
Start Time: Tuesday, April 5, 2022, at 13:20:00 UTC
End Time: Tuesday, April 5, 2022, at 18:20:00 UTC
Duration: 5:00:00

What happened:
Alerting was halted for all accounts for the entire duration of the incident. Most alerts – any whose trigger date was more than 15 minutes in the past – were discarded.

Why it happened:
We restarted our parser service, for reasons unrelated to this incident. Any restart of the parser service should be followed by a restart of the alerting service. This second step was overlooked and didn’t happen. Subsequently, all alerts stopped triggering.

The need to restart alerting after a restart of the parser is already documented and well-known to our infrastructure team. However, the restart of the parser was performed by a team less familiar with the correct procedure.

How we fixed it:
We manually restarted the alerting service, which then returned to normal operation.

What we are doing to prevent it from happening again:
The proper documented restart procedure has been discussed with all teams allowed to restart services.

We will add monitoring of our alerting service and automated notifications so we learn more quickly of any similar incidents in the future.

Posted Apr 06, 2022 - 17:58 UTC

Resolved
Alerts of all types have resumed. Alerts are fully operational.
Posted Apr 05, 2022 - 18:32 UTC
Investigating
Currently, alerts of all types are not triggering. We are taking remedial action.
Posted Apr 05, 2022 - 18:23 UTC
This incident affected: Log Analysis (Alerting).