Hi, I’m hoping to get some more insight on how retries, error logs and HTTP 5xx error metrics interact.
What I’m seeing is HTTP 5xx responses reported in the
haproxy_backend_http_responses_total metrics. Some of these error responses from the metrics don’t correspond to errors reported in the logs, and some of them seem like they should be basically impossible given the aggressive retry config on their backends (eg. simple GET requests — not huge post requests which could exceed buffer limits — against a backend which proxies an S3 bucket with multiple reties and
My main question is: do the frontend (or backend)
*_http_response_* metrics actually always correspond to the final HTTP responses sent to active clients, or will they include records of other kinds of responses like:
- Error responses from a backend which are then retried before being communicated with a client
- Internal errors generated because a client disconnected part-way through a request (so the response was never fully sent to the client)
I’m really trying to answer the question “Are we serving up 5xx responses to our actual users?”, so if there’s another stat which can answer that more accurately, then I’d be keen to learn about it!
Running HAProxy 2.4.3
Metrics extracted with haproxy-exporter 0.7.1 (not with the new built in prometheus endpoint - apologies if the metric names don’t exactly line up)
We log via syslog at the default level with:
When we implemented a more aggressive retry config in HAProxy, our metrics for
haproxy_backend_connection_errors_total dropped from a few per hour to (near) zero. I would naively have expected to still see errors here, even if the final request was retried successfully, but I’m guessing that’s this metric only records connection errors which affected the final response: is that right?
Thanks for any insights you can provide!