Hello, I am trying to dive deep into routing failures due to timeout connect (under the hood it appears to be timeout queue) but I’m only seeing queue times reach the max of the first request, even though there is a retry/redispatch. Can you clarify the expected time values in the following configuration/scenario:
timeout connect 600ms
timeout client 11s
timeout server 900ms
balance roundrobin
option http-keep-alive
retries 1
option redispatch 1
retry-on empty-response 0rtt-rejected conn-failure response-timeout 503 500
http-reuse always
We are seeing some failures with the following log format:
Nov 14 12:14:41 localhost haproxy[27714]: <client_ip> [14/Nov/2024:12:14:40.518] frontend_srv~ backend_srv/srv_name 0 601 -1 -1 +1200 503 +217 - - sC-- 2368/2368/2363/6/+1 0/0 "GET /ws HTTP/1.1" <backend_source_ip> <backend_source_port> <server_ip> <port> <stickiness_id> - - - - <request_id> - -
Based on the log format : HAProxy version 2.8.12-12 - Configuration Manual
This suggests that the request had a queue time of 600ms and a total active request time of 1200ms and failed due to timeout (this makes sense in theory if we had 2 failures, each timing out at 600ms).
But shouldn’t the total queue time also be 1200ms? Or does queue time only track the time of the request that’s actually logged in the failure? If it’s total queue time, where else could the extra time be spent?
When we see timeout failures, it always follows this structure, where the queue_time is 600ms (equal to our timeout connect
) but the active_req_time
is 1200ms.
Queue time normally only reflects actual time spent in queue, so retry time should not be accounted for.
But there is a known side effect due to the use of option redispatch
which you seem to be using.
In this case, queue time may end up growing up to timeout connect
* retries
even if the queue was not involved at all on the connection. Thus I would say that if redispatch occurred (can be checked by the presence of the ‘+’ leading character before %rc (retries) logformat tag), it would be wiser to ingore queue time.
In your log %rc equals “+1”, which means 1 retry + redispatch, thus queue time is not accurate (due to the redispatch side effect)
Hi, thanks for the info.
Can you provide more info on this side effect:
But there is a known side effect due to the use of option redispatch which you seem to be using.
In this case, queue time may end up growing up to timeout connect * retries even if the queue was not involved at all on the connection.
Do you know what specifically is happening? Is there any documentation on this side effect? More importantly, if we observe queue time growing to timeout connect on retries and it’s hitting this side effect, has the queue time actually grown on any of the requests or is it totally an incorrect value? I ask because we need to know the actual reason for this failure, and we were assuming it was a load problem in our HAProxy instances but if the queue_time value is erroneous then we would need a new reason.
I guess asked a different way: what could cause this log value based on our connection configuration? The server_conn_time
and server_resp_time
values were also -1, so I don’t think it can be due to the response time of our backends, since this indicates that the request never actually got sent to backend right? It has to be a queueing time in haproxy, otherwise what else could be the reason?
It is not documented, it is only observed: when redispatch occured on the request, the Queue time will not not reflect actual Queue time… but time spent from accept to redispatch…
Aside from that (ie: when no redispatch occured), queue time counts the time spent in queue as it is meant to.
For redispatch to occur, it means the server that haproxy chose to sent the request was unable to handle the request in time, and thus haproxy (after X retries attempts, in your case 1) tried to redispatch the request on another server in the hope that this new server will be able to handle the request.
Unless maxconn proxy setting is used and it reached, haproxy won’t queue the requests.
Since server_conn_time is -1 I suspect there is an issue between haproxy and the server: perhaps a network issue, or the server is not able to accept new connections anymore…
Try temporarily disabling redispatch option to further diagnose the issue if you don’t expect servers to fail
Hi, thanks for the info!
Just to confirm, if there is a redisaptch, we can’t trust the queue_time
being the time spent in the queue, but we can confirm that the request did in fact timeout right? Like the cause of the failure is due to our timeout connect
setting being hit on each request? But we just don’t know that the time spent was actually in the queue. Is that correct?
Can you help me understand this request log in the case of this limitation?
Nov 15 11:16:24 localhost haproxy[32136]: <client_ip> [15/Nov/2024:11:16:23.725] frontend_srv~ backend_srv/srv_name 0 599 4 44 +647 101 +153 - - ---- 1322/1322/1321/4/+1 0/0 "GET /ws HTTP/1.1" <backend_source_ip> <backend_source_port> <server_ip> <port> <stickiness_id>0 - - - - <request_id> - -
In this case, it appears the redispatch was successful (status code 101 with retries +1)
According to log, queue_time was 600ms and totaactive_req_time was 647. So the first try failed, due to a timeout, not spending that time in the queue but for some other reason. The new request succeeded in normal time, so the total time was 647?
BTW, a bit unrelated, but is there any way to see the backend server info of all previous attempts in a redispatch case?
Not necessarily a timeout, because a retry can occur because of connection failure (which can be caused by timeout but also by a network error for instance), and a delay of MIN(timeout connect, 1s) is applied between each retry attempt (documented here: HAProxy version 3.1-dev12-14 - Configuration Manual)
Is this delay going to happen in our configuration though? I had thought no, since we never retry on the same server and the documentation had that as the reason for the delay. But now that I’m rereading it isn’t super clear whether there will be delay of min(“timeout connect”, one second) if retries are enabled but redispatch on first retry
Oh yes indeed nice catch! In your case you are correct no delay should be applied because retries is set to 1 and redispatch option is set with default interval. So it means that the first retry is directly performed on another server with no delay (as the delay is only applied for retries occurring on the same server)
So if Queue time shows 600ms after re-dispatch, it could possibly mean that the connection attempt to initial server timed out (as timeout connect is set to 600ms in your case), and only then haproxy attempted to redispatch the request on another server.