Hello,
we have HAproxy set up with ~200 backend servers, and I noticed that the haproxy_server_response_time_average_seconds
is acceptable (around 400ms) but the metrics for haproxy_server_total_time_average_seconds
is almost around 2 minutes for each server, and would like to debug the issue. I would like to ask, what is the fundamental difference between response_time
and total_time
? Since I don’t know their components I don’t know why the difference is this big. Also would appreciate any ideas for debugging the issue. Thanks
As per the Prometheus field description, haproxy_server_response_time_average_seconds
is average response time for last 1024 successful connections (only the response time), whereas haproxy_server_total_time_average_seconds
is average total time for last 1024 successful connections, which is the total duration of the stream (queue, connect and response time all cumulated plus additional time not necessarily tracked by subcounters).
If the total time is higher than expected, then it may be interesting to check available individual timers to see where the time is spent. (queue time, connect time)
Also, response time doesn’t count for the complete duration of the response, it reflects the time spent waiting for the first response byte from the server. Thus if the server sends a large amount of data, it would be expected that the total time is a lot bigger than the response time because most time could actually be spent between the first byte from the response and the actual end of the response.