I have a question about calculating response times with Prometheus summary metrics.
I created a summary metric that does not only contain the service name but also the complete path and the http-method.
Now I try to calculate the average response time for the complete service. I read the article about "rate then sum" and either I do not understand how the calculation is done or the calculation is IMHO not correct.
As far as I read this should be the correct way to calculate the response time per second:
sum by(service_id) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
/
rate(request_duration_count{status_code=~"2.*"}[5m])
)
What I understand here is create the "duration per second" (rate sum / rate count) value for each subset and then creates the sum per service_id.
This looks absolutely wrong for me - but I think it does not work in the way I understand it.
Another way to get an equal looking result is this:
sum without (path,host) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
/
rate(request_duration_count{status_code=~"2.*"}[5m])
)
If I would ignore everything I read I would try it in the following way:
rate(sum by(service_id) request_duration_sum{status_code=~"2.*"}[5m])
/
rate(sum by(service_id) request_duration_count{status_code=~"2.*"}[5m])
But this will not work at all... (instant vector vs range vector and so on...).
All of these examples are aggregating incorrectly, as you're averaging an average. You want:
sum without (path,host) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
)
/
sum without (path,host) (
rate(request_duration_count{status_code=~"2.*"}[5m])
)
Which will return the average latency per status_code
plus any other remaining labels.