an alternative to `keep_firing_for` that takes the `up`ness of metrics/targets into account #14085

calestyo · 2024-05-13T02:21:36Z

Proposal

Hey there.

(Spoiler: IMO the current keep_firing_for is not a proper solution for what I'm dealing with).

This originates from a discussion at: https://groups.google.com/g/prometheus-users/c/Ihh1Dk0puQw

The problem I have is, that some of the nodes I monitor are quite unstable, that is scraping fails every now and then (or even much more often than that) for something like a few seconds (I have a scrape interval of 10s) up to 1,5 min.

What happens then is, that any alerts that were firing, stop doing so as - for Prometheus - the node basically disappeared and the alert's query gives no result for it anymore.
Once the flapping stops, the alert pops up back.

Since Prometheus wrongly considered the disappearing of the metrics used by the alert rule as "alert condition no longer exists", it sends of course a fresh notification (like an email), even if the repeat interval hasn't been reached yet.

Now if that flapping happens every 10 mins, things get quite noisy.
(Oh, and I cannot fix the flapping nodes, some because I just monitor them and they're otherwise out of my control, some because I actually fail to do so ^^ see [0]).

Asking the community, I was told about keep_firing_for (well actually I already knew about it when writing my post but was already not happy with it).

I think there are at least two kinds of flapping (copy and pasting from my Google Groups post):

For example, assume an alert that monitors the utilised disk space on the root fs, and fires whenever that's above 80%.

Type 1 Flapping:
- The scraping of the metrics works all the time (i.e. up is all the time 1).
- But IO is happening, that just causes the 80% to be exceeded and then fallen below every now and then.
Type 2 Flapping
- Scrapes fail every now and then.
- There is IO, but the utilisation is always above 80%, say it's already at ~ 90% all the time.

As I understand keep_firing_for, it will be (mostly[1]) perfect for type 1:

If my flapping is just that the threshold is exceeded and than no longer, I can set keep_firing_for to whatever I consider a reasonable period, where the alert condition is still considered to be ongoing rather than being a new one. Say 10 mins.
If then it clears for 9 mins but comes shortly back and then again clears... it will be still like one continuous alert. If it clears for 10 mins (of course assuming the evaluation interval fits), a new alert condition will be considered a fresh alert.

Now what are the general downsides of keep_firing_for - when using them for type 2 flapping (not for type 1 - or at least for that, there's no real way around these downsides)?

Alerts get artificially extended, while they may actually have already cleared. And the longer one needs to set the period, the worse that gets.
One may actually miss true new alerts, i.e. in cases where the number of alerts is actually interesting, a 2nd alert could be missed as a new one, because the first (which has however actually already cleared) is still ongoing because of keep_firing_for.

And if one introduces a new use cases, namely downtimes, the "flapping" can not really be cured with keep_firing_for anymore.
Consider a reboot with firmware updates, which may take quite a while, the same would happen as in my Type 2 flapping scenario.
Scrapes would fail, alert would clear, scrapes would come back, notifications would be resent.
And I think it's not reasonable to set keep_firing_for longer than perhaps 1h or so.
Sure one can manually silence the alerts ahead of a downtime, but why putting that burden on the admin?

So what could be done better for Type 2 flapping and for downtimes?

Idea:

I think, the above issues might be fixed by an alternative keep_firing_for_with_considerig_up that takes into account whether the metrics (well actually the targets) used by an alert query are all up or not.

If an alert, for which keep_firing_for_with_considerig_up is set is firing, and then suddenly the alert condition is gone... it would continue to fire as long any of the queried targets/metrics are not up... and as long as the time span hasn't been exceeded (there probably should be an infinite value).

If only once the the metrics come back up, the keep_firing_for_with_considerig_up would stop (though the alert may of course still fire, if it was still firing, with the up metrics).

In simple words: a possibly infinite version of keep_firing_for that takes up into account and whether an alert likely just "vanished" temporarily because the scraping failed.

I would think that the current keep_firing_for is still needed, especially for use cases like type 1 flapping, where my keep_firing_for_with_considerig_up would fail if both types of flapping happen at the same time.

What might be tricky in implementing a keep_firing_for_with_considerig_up is to determine which metrics/targets needs to be taken into account.
For a simple alert, like when having an alert that checks for free space on the root fs like:

      expr: '1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="rootfs"} / node_filesystem_size_bytes  >  0.80'
      for:  4m

things are probably clear:
If the alert fires for foo.example.com, then suddenly no longer fires but the node_* metrics are down for foo.example.com, well then it's probably just no longer firing because of that being down, so keep artificially going.

But what if an alert mixes metrics from different instance?
Simply saying "if any of them fails", do the keep_firing_for_with_considerig_up may be to simple.
Cause there could be alerts like (pseudo code):

fire if: metric-on-node-1 = 1  and  metric-on-node-2 = 1

where the alert should stop firing if the metric on node-1 gets 0 (and the node is up), even if node-2 is currently down, because it's anyway clear that the overall condition would no longer produce something.

Thanks,
Chris.

[0] https://groups.google.com/g/prometheus-users/c/Ihh1Dk0puQw/m/cNaVrEjqAQAJ
at the very bottom
[1] #14084

The text was updated successfully, but these errors were encountered:

gotjosh · 2024-05-28T08:42:45Z

Have you considered #14061 ?

If the underlying problem is that metrics might not have arrived "on-time" for a brief period of time, rule evaluation offset could be a solution at the trade-off of slightly slower detection.

calestyo · 2024-05-31T00:33:39Z

@gotjosh
No, I don't think that's the problem here. The metrics never arrive, as some nodes are every now and then simply not reachable (or they do not answer because of being overloaded).

So even if one would shift the evaluation in the past (which I actually do in some alerts, but for other reasons), the data would still not be there.

The problem here IMO really is that Prometheus should provide – for that specific kind of flapping – a smarter way of handling it, i.e. that automatic prolongation of the alert, but unlike the current keep_firing_for only in case and only so long the metrics required by the alert, are down.

beorn7 · 2024-06-12T17:36:40Z

So yeah, this is not solvable by delayed rule evaluation. The data is actually missing.

However, the proposed solution will be tricky to implement cleanly. As already considered above, alert expression could touch metrics from different targets. But it's even worse. They could be based on other (recording) rules, which do not have any target at all they are associated with. And even what we call "associated with a target" is not a 1st class concept in Prometheus. There are target labels, of course, and it's generally a good heuristics to assume that an up with the same target labels as any other given metric is coming from the same target. But there are a million ways of manipulating labels, and after target labels have been attached during scraping, they are labels like any other labels, so the data model doesn't give a safe way of linking any given metric with an up metric. I find it problematic to create a 1st class feature (alerting control based on scrape success) that is based on something that just happens to work "most of the time". (I could imagine a future version of Prometheus that has a better notion of where a metric is coming from, but I digress.)

In general, I would say that "scrape failures happen now and then" is something that should be dealt with reasonably by Prometheus. But a case where it's normal that scrape failures happen frequently might be rare enough to address it with a work-around rather than a 1st class feature.

Having said that, intermittent scrape failures are a problem for other types of alerts, too, see https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0/ . In a similar spirit, you could craft your alerting expression to be robust against scrape failures, using average_over_time or max_over_time etc.

If that doesn't help either, you might be able to hack this in the Alertmanager. If an alert goes away and then comes back, it is seen as "new" by the Alertmanager, so it re-notifies when group_interval has passed rather than repeat_interval. However, you can set group_interval to a longer time (if need be just for the alerts you are having the problem with, of course at the price of defining a dedicated route for them, that's why this is more a hack).

calestyo changed the title ~~an alternative to keep_firing_for that takes the upness of metrics/targets into account.~~ an alternative to keep_firing_for that takes the upness of metrics/targets into account May 13, 2024

calestyo mentioned this issue May 28, 2024

Add 'keep_firing_for' field to alerting rules #11827

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

an alternative to `keep_firing_for` that takes the `up`ness of metrics/targets into account #14085

an alternative to `keep_firing_for` that takes the `up`ness of metrics/targets into account #14085

calestyo commented May 13, 2024

gotjosh commented May 28, 2024

calestyo commented May 31, 2024

beorn7 commented Jun 12, 2024

an alternative to keep_firing_for that takes the upness of metrics/targets into account #14085

an alternative to keep_firing_for that takes the upness of metrics/targets into account #14085

Comments

calestyo commented May 13, 2024

Proposal

gotjosh commented May 28, 2024

calestyo commented May 31, 2024

beorn7 commented Jun 12, 2024

an alternative to `keep_firing_for` that takes the `up`ness of metrics/targets into account #14085

an alternative to `keep_firing_for` that takes the `up`ness of metrics/targets into account #14085