increase() in Prometheus sometimes doubles values: how to avoid?

sanmai picture sanmai · Jan 12, 2018 · Viewed 10.5k times · Source

I've found that for some graphs I get doubles values from Prometheus where should be just ones:

Graph with twos above bars

Query I use:

increase(signups_count[4m])

Scrape interval is set to the recommended maximum of 2 minutes.

If I query the actual data stored:

curl -gs 'localhost:9090/api/v1/query?query=(signups_count[1h])'

"values":[
     [1515721365.194, "579"],
     [1515721485.194, "579"],
     [1515721605.194, "580"],
     [1515721725.194, "580"],
     [1515721845.194, "580"],
     [1515721965.194, "580"],
     [1515722085.194, "580"],
     [1515722205.194, "581"],
     [1515722325.194, "581"],
     [1515722445.194, "581"],
     [1515722565.194, "581"]
],

I see that there were just two increases. And indeed if I query for these times I see an expected result:

curl -gs 'localhost:9090/api/v1/query_range?step=4m&query=increase(signups_count[4m])&start=1515721965.194&end=1515722565.194'

"values": [
     [1515721965.194, "0"],
     [1515722205.194, "1"],
     [1515722445.194, "0"]
],

But Grafana (and Prometheus in the GUI) tends to set a different step in queries, with which I get a very unexpected result for a person unfamiliar with internal workings of Prometheus.

curl -gs 'localhost:9090/api/v1/query_range?step=15&query=increase(signups_count[4m])&start=1515721965.194&end=1515722565.194'

... skip ...
 [1515722190.194, "0"],
 [1515722205.194, "1"],
 [1515722220.194, "2"],
 [1515722235.194, "2"],
... skip ...

Knowing that increase() is just a syntactic sugar for a specific use-case of the rate() function, I guess this is how it is supposed to work given the circumstances.

How to avoid such situations? How do I make Prometheus/Grafana show me ones for ones, and twos for twos, most of the time? Other than by increasing the scrape interval (this will be my last resort).

I understand that Prometheus isn't an exact sort of tool, so it is fine with me if I would have a good number not at all times, but most of the time.

What else am I missing here?

Answer

brian-brazil picture brian-brazil · Jan 12, 2018

This is known as aliasing and is a fundamental problem in signal processing. You can improve this a bit by increasing your sample rate, a 4m range is a bit short with a 2m range. Try a 10m range.

Here for example the query executed at 1515722220 only sees the [email protected] and [email protected] samples. That's an increase of 1 over 2 minutes, which extrapolated over 4 minutes is an increase of 2 - which is as expected.

Any metrics-based monitoring system will have similar artifacts, if you want 100% accuracy you need logs.