Prometheus Alertmanager and Incident Keys
I run a centralized Alertmanager service for about 100 individual software development and operations teams. Normally, each team has their own Prometheus VM(s) and sometimes a dedicated Prometheus VM or two will be created for a specific big or busy service. A single pair of Alertmanager instances scales very nicely at this load with 100s of alerts firing.
However, after a specific configuration change, I started having really weird effects. Commonly I would get two identical pages and some other teams experienced this. Specific testing could not reproduce the issue. Not all teams seemed affected. This stumped us for a while and we couldn’t identify how the configuration change could be related to this odd behavior.
Looking carefully at the PagerDuty alerts, they were clearly duplicate alerts
for the same issue. But, the incident_key
value was different. PagerDuty
uses this key for incident uniqueness. The Alertmanager somehow always
generates the same incident key value for the alert group. Even if the
Alertmanager clustering is broken. The worst case for Alertmanager is that it
updated the same PagerDuty incident multiple times. Of course the
incident_key
looks like this:
efa811541777ecf7eff9ba5992090f99decd355dbc4c4213893de0040b737bd1
You guessed it! That’s a SHA256 hashsum. That doesn’t shed a lot of light on
the situation. In fact in notify/impl.go
you can find the hashKey()
function that’s used to obscure the incident key for the PagerDuty API. Its
nothing special. But it keeps the incident key value relatively short, unique,
and no strange characters.
The real magic started happening as I read the Notify()
function in the
same file. It pulls out the incident key from the Go context variable (yay
more redirection) but then, lo, it logs the value of the plain text key! So a
quick grep through my logs on different Alertmanager VMs showed me interesting
things.
# grep "Notifying PagerDuty" /var/log/upstart/prometheus-alertmanager.log|grep HostDown
level=debug ts=2019-06-10T18:24:23.068128172Z caller=impl.go:562 msg="Notifying PagerDuty" incident="{}/{}/{monitor=\"core-ux\"}/{environment=~\"^(?:|prod)$\",severity=~\"^(?:page)$\"}:{alertname=\"HostDown\", severity=\"page\"}" eventType=trigger
On a second VM:
# journalctl -u alertmanager|grep "Notifying PagerDuty" |grep HostDown
Jun 10 18:24:13 alertmanager-prod-1.example.com docker[479388]: level=debug ts=2019-06-10T18:24:13.026543305Z caller=impl.go:562 msg="Notifying PagerDuty" incident="{}/{}/{monitor=\"core-ux\"}/{environment=~\"^(?:|prod|glob)$\",severity=~\"^(?:page)$\"}:{alertname=\"HostDown\", severity=\"page\"}" eventType=trigger
Due to the size of the fleet and number of teams the routing for Alertmanager is more complex that I’d like to admit. However, this reveals several really interesting things about alert routing in Alertmanager.
- Alertmanager has a
group_by
parameter that sets what labels will “group” alerts. This gives us one page per event type, rather than one page per VM. Thegroup_by
labels here arealertname
,environment
, andseverity
. What’s really interesting is to see how this forms a uniqueincident_key
. - Not only are the
group_by
labels used to form theincident_key
, but the path in the routing tree! I have nodes several layers deep with regular expression matching. - I had a different regular expression match on some of my Alertmanagers! It was this that caused different incident keys to be generated and duplicate alerts to fire.