I've found that for some graphs I get doubles values from Prometheus where should be just ones:
Query I use:
increase(signups_count[4m])
Scrape interval is set to the recommended maximum of 2 minutes.
If I query the actual data stored:
curl -gs 'localhost:9090/api/v1/query?query=(signups_count[1h])'
"values":[
[1515721365.194, "579"],
[1515721485.194, "579"],
[1515721605.194, "580"],
[1515721725.194, "580"],
[1515721845.194, "580"],
[1515721965.194, "580"],
[1515722085.194, "580"],
[1515722205.194, "581"],
[1515722325.194, "581"],
[1515722445.194, "581"],
[1515722565.194, "581"]
],
I see that there were just two increases. And indeed if I query for these times I see an expected result:
curl -gs 'localhost:9090/api/v1/query_range?step=4m&query=increase(signups_count[4m])&start=1515721965.194&end=1515722565.194'
"values": [
[1515721965.194, "0"],
[1515722205.194, "1"],
[1515722445.194, "0"]
],
But Grafana (and Prometheus in the GUI) tends to set a different step
in queries, with which I get a very unexpected result for a person unfamiliar with internal workings of Prometheus.
curl -gs 'localhost:9090/api/v1/query_range?step=15&query=increase(signups_count[4m])&start=1515721965.194&end=1515722565.194'
... skip ...
[1515722190.194, "0"],
[1515722205.194, "1"],
[1515722220.194, "2"],
[1515722235.194, "2"],
... skip ...
Knowing that increase()
is just a syntactic sugar for a specific use-case of the rate()
function, I guess this is how it is supposed to work given the circumstances.
How to avoid such situations? How do I make Prometheus/Grafana show me ones for ones, and twos for twos, most of the time? Other than by increasing the scrape interval (this will be my last resort).
I understand that Prometheus isn't an exact sort of tool, so it is fine with me if I would have a good number not at all times, but most of the time.
What else am I missing here?
This is known as aliasing and is a fundamental problem in signal processing. You can improve this a bit by increasing your sample rate, a 4m range is a bit short with a 2m range. Try a 10m range.
Here for example the query executed at 1515722220 only sees the [email protected] and [email protected] samples. That's an increase of 1 over 2 minutes, which extrapolated over 4 minutes is an increase of 2 - which is as expected.
Any metrics-based monitoring system will have similar artifacts, if you want 100% accuracy you need logs.