I’ve been researching quite a few algorithms for my client, Bruce, as I continue to scale out certain systems. I thought that getting them on my blog would be very useful for a future version of myself and many others. I suspect, and hope, that most folks that work in Systems Administration / Operations will find these at least familiar.

Flap Detection

Flap detection is the ability to determine rapid state changes so that one can take corrective action. The text book example is BGP route flapping where a new route is advertised and then withdrawn (a flap) multiple times in a short period. In general, this “scores” an event with higher scores the more frequent the event.

Warning: Math

For each event, a “penalty” $P$, “timestamp” $t$, and a boolean “isFlapping” variable is stored. $P$ is the current “score” for this event and is $0$ for an event that has not happened. $t$ is the timestamp, usually Unix Epoch time, of the last event.

Each time an event occurs we add a penalty value to the current penalty. The penalty value decays exponentially over time. A suppress limit is known which defines if the event is flapping when the event’s penalty is greater than the suppress limit. A lower reuse limit is known when the event is considered no longer flapping once the penalty value is less than the reuse limit.


Penalty$P$An event's current score.
Half-Life$h$How much time for half of the penalty to decay.
Timestamp$t$Unix Epoch time of last event.
Supress LimitPenalty > Suppress Limit == Flapping
Reuse LimitPenalty < Reuse Limit && Flapping == Recovery

Calculating for $P$

$$P(t_2) = P(t_1) \times e^{-(t_2 - t_1) \times ln(2) / h}$$

In Pictures


f(x) = 2.71828 ** (-x*log(2) / 60)
set ylabel "Penalty"
set xlabel "Time"
set key off
set terminal png size 800,600
set output 'output.png'
set label "Reuse Limit" at 40,11
set label "Suppress Limit" at 100,26
plot [0:240] x < 10 ? f(x) : x < 20 ? f(x-10)*10 : x < 30 ? f(x-20)*20 : f(x-30)*30, 10 with dots, 25 with dots

Previous   Up   Next

comments powered by Disqus