-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
an alternative to keep_firing_for
that takes the up
ness of metrics/targets into account
#14085
Comments
keep_firing_for
that takes the up
ness of metrics/targets into account.keep_firing_for
that takes the up
ness of metrics/targets into account
Have you considered #14061 ? If the underlying problem is that metrics might not have arrived "on-time" for a brief period of time, rule evaluation offset could be a solution at the trade-off of slightly slower detection. |
@gotjosh So even if one would shift the evaluation in the past (which I actually do in some alerts, but for other reasons), the data would still not be there. The problem here IMO really is that Prometheus should provide – for that specific kind of flapping – a smarter way of handling it, i.e. that automatic prolongation of the alert, but unlike the current |
So yeah, this is not solvable by delayed rule evaluation. The data is actually missing. However, the proposed solution will be tricky to implement cleanly. As already considered above, alert expression could touch metrics from different targets. But it's even worse. They could be based on other (recording) rules, which do not have any target at all they are associated with. And even what we call "associated with a target" is not a 1st class concept in Prometheus. There are target labels, of course, and it's generally a good heuristics to assume that an In general, I would say that "scrape failures happen now and then" is something that should be dealt with reasonably by Prometheus. But a case where it's normal that scrape failures happen frequently might be rare enough to address it with a work-around rather than a 1st class feature. Having said that, intermittent scrape failures are a problem for other types of alerts, too, see https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0/ . In a similar spirit, you could craft your alerting expression to be robust against scrape failures, using If that doesn't help either, you might be able to hack this in the Alertmanager. If an alert goes away and then comes back, it is seen as "new" by the Alertmanager, so it re-notifies when |
Proposal
Hey there.
(Spoiler: IMO the current
keep_firing_for
is not a proper solution for what I'm dealing with).This originates from a discussion at: https://groups.google.com/g/prometheus-users/c/Ihh1Dk0puQw
The problem I have is, that some of the nodes I monitor are quite unstable, that is scraping fails every now and then (or even much more often than that) for something like a few seconds (I have a scrape interval of 10s) up to 1,5 min.
What happens then is, that any alerts that were firing, stop doing so as - for Prometheus - the node basically disappeared and the alert's query gives no result for it anymore.
Once the flapping stops, the alert pops up back.
Since Prometheus wrongly considered the disappearing of the metrics used by the alert rule as "alert condition no longer exists", it sends of course a fresh notification (like an email), even if the repeat interval hasn't been reached yet.
Now if that flapping happens every 10 mins, things get quite noisy.
(Oh, and I cannot fix the flapping nodes, some because I just monitor them and they're otherwise out of my control, some because I actually fail to do so ^^ see [0]).
Asking the community, I was told about
keep_firing_for
(well actually I already knew about it when writing my post but was already not happy with it).I think there are at least two kinds of flapping (copy and pasting from my Google Groups post):
For example, assume an alert that monitors the utilised disk space on the root fs, and fires whenever that's above 80%.
Type 1 Flapping:
up
is all the time 1).Type 2 Flapping
As I understand
keep_firing_for
, it will be (mostly[1]) perfect for type 1:keep_firing_for
to whatever I consider a reasonable period, where the alert condition is still considered to be ongoing rather than being a new one. Say 10 mins.If then it clears for 9 mins but comes shortly back and then again clears... it will be still like one continuous alert. If it clears for 10 mins (of course assuming the evaluation interval fits), a new alert condition will be considered a fresh alert.
Now what are the general downsides of
keep_firing_for
- when using them for type 2 flapping (not for type 1 - or at least for that, there's no real way around these downsides)?keep_firing_for
.And if one introduces a new use cases, namely downtimes, the "flapping" can not really be cured with
keep_firing_for
anymore.Consider a reboot with firmware updates, which may take quite a while, the same would happen as in my Type 2 flapping scenario.
Scrapes would fail, alert would clear, scrapes would come back, notifications would be resent.
And I think it's not reasonable to set
keep_firing_for
longer than perhaps 1h or so.Sure one can manually silence the alerts ahead of a downtime, but why putting that burden on the admin?
So what could be done better for Type 2 flapping and for downtimes?
Idea:
I think, the above issues might be fixed by an alternative
keep_firing_for_with_considerig_up
that takes into account whether the metrics (well actually the targets) used by an alert query are allup
or not.If an alert, for which
keep_firing_for_with_considerig_up
is set is firing, and then suddenly the alert condition is gone... it would continue to fire as long any of the queried targets/metrics are not up... and as long as the time span hasn't been exceeded (there probably should be an infinite value).If only once the the metrics come back up, the
keep_firing_for_with_considerig_up
would stop (though the alert may of course still fire, if it was still firing, with the up metrics).In simple words: a possibly infinite version of
keep_firing_for
that takesup
into account and whether an alert likely just "vanished" temporarily because the scraping failed.I would think that the current
keep_firing_for
is still needed, especially for use cases like type 1 flapping, where mykeep_firing_for_with_considerig_up
would fail if both types of flapping happen at the same time.What might be tricky in implementing a
keep_firing_for_with_considerig_up
is to determine which metrics/targets needs to be taken into account.For a simple alert, like when having an alert that checks for free space on the root fs like:
things are probably clear:
If the alert fires for foo.example.com, then suddenly no longer fires but the node_* metrics are down for foo.example.com, well then it's probably just no longer firing because of that being down, so keep artificially going.
But what if an alert mixes metrics from different instance?
Simply saying "if any of them fails", do the
keep_firing_for_with_considerig_up
may be to simple.Cause there could be alerts like (pseudo code):
where the alert should stop firing if the metric on node-1 gets 0 (and the node is up), even if node-2 is currently down, because it's anyway clear that the overall condition would no longer produce something.
Thanks,
Chris.
[0] https://groups.google.com/g/prometheus-users/c/Ihh1Dk0puQw/m/cNaVrEjqAQAJ
at the very bottom
[1] #14084
The text was updated successfully, but these errors were encountered: