Prometheus Alertmanager and Incident Keys

I run a centralized Alertmanager service for about 100 individual software development and operations teams. Normally, each team has their own Prometheus VM(s) and sometimes a dedicated Prometheus VM or two will be created for a specific big or busy service. A single pair of Alertmanager instances scales very nicely at this load with 100s of alerts firing.

However, after a specific configuration change, I started having really weird effects. Commonly I would get two identical pages and some other teams experienced this. Specific testing could not reproduce the issue. Not all teams seemed affected. This stumped us for a while and we couldn’t identify how the configuration change could be related to this odd behavior.

Looking carefully at the PagerDuty alerts, they were clearly duplicate alerts for the same issue. But, the incident_key value was different. PagerDuty uses this key for incident uniqueness. The Alertmanager somehow always generates the same incident key value for the alert group. Even if the Alertmanager clustering is broken. The worst case for Alertmanager is that it updated the same PagerDuty incident multiple times. Of course the incident_key looks like this:

efa811541777ecf7eff9ba5992090f99decd355dbc4c4213893de0040b737bd1

You guessed it! That’s a SHA256 hashsum. That doesn’t shed a lot of light on the situation. In fact in notify/impl.go you can find the hashKey() function that’s used to obscure the incident key for the PagerDuty API. Its nothing special. But it keeps the incident key value relatively short, unique, and no strange characters.

The real magic started happening as I read the Notify() function in the same file. It pulls out the incident key from the Go context variable (yay more redirection) but then, lo, it logs the value of the plain text key! So a quick grep through my logs on different Alertmanager VMs showed me interesting things.

# grep "Notifying PagerDuty" /var/log/upstart/prometheus-alertmanager.log|grep HostDown
level=debug ts=2019-06-10T18:24:23.068128172Z caller=impl.go:562 msg="Notifying PagerDuty" incident="{}/{}/{monitor=\"core-ux\"}/{environment=~\"^(?:|prod)$\",severity=~\"^(?:page)$\"}:{alertname=\"HostDown\", severity=\"page\"}" eventType=trigger

On a second VM:

# journalctl -u alertmanager|grep "Notifying PagerDuty" |grep HostDown
Jun 10 18:24:13 alertmanager-prod-1.example.com docker[479388]: level=debug ts=2019-06-10T18:24:13.026543305Z caller=impl.go:562 msg="Notifying PagerDuty" incident="{}/{}/{monitor=\"core-ux\"}/{environment=~\"^(?:|prod|glob)$\",severity=~\"^(?:page)$\"}:{alertname=\"HostDown\", severity=\"page\"}" eventType=trigger

Due to the size of the fleet and number of teams the routing for Alertmanager is more complex that I’d like to admit. However, this reveals several really interesting things about alert routing in Alertmanager.

Alertmanager has a group_by parameter that sets what labels will “group” alerts. This gives us one page per event type, rather than one page per VM. The group_by labels here are alertname, environment, and severity. What’s really interesting is to see how this forms a unique incident_key.
Not only are the group_by labels used to form the incident_key, but the path in the routing tree! I have nodes several layers deep with regular expression matching.
I had a different regular expression match on some of my Alertmanagers! It was this that caused different incident keys to be generated and duplicate alerts to fire.

LinuxCzar

Prometheus Alertmanager and Incident Keys