Haproxy retry and redispatch not working as expected

jaykb77 · January 15, 2025, 11:04am

We have haproxy-2.2.6 and are using basic config for our backend hec-backend which is load balancing a REST api based service.

backend hec-backend
    mode http
    option httpchk
    http-check send meth GET  uri /services/collector/health
    server hec_192.168.10.5 192.168.10.5:8088 maxconn 16 weight 10  check port 8088
    server hec_192.168.10.6 192.168.10.6:8088 maxconn 16 weight 10  check port 8088
    server hec_192.168.10.7 192.168.10.7:8088 maxconn 16 weight 10  check port 8088

Recently we have been seeing intermittent 503 errors from some of the backend machines due to high load. While the permanent solution is to upgrade hardware. As this can take some time, we are trying to see if we can redispatch such requests away from the high load machines as a step to mitigate the issue.

We tried to add something like below to this backend

    retries 3
    retry-on all-retryable-errors 502 503 504
    option redispatch 1

The intention was to redispatch request to another backend machine whenever we see a failure, primarily 5xx errors. When testing this using curl, it worked as expected with low request rate, but we found that, during high request rate in an actual scenario, most of the requests still ends up with 503. It is not clear why haproxy is not redispatching requests.

Please share any thoughts on why this may be happening.

Example of error (while still using redispatch)

<30>Jan  3 11:14:16 splunk_haproxy[2638905]: 192.168.114.15:52148 [03/Jan/2025:11:14:16.907] http-in~ hec-backend/hec_192.168.10.5 0/0/0/0/0 503 66 - - ---- 4/2/0/0/0 0/0 "POST /services/collector/event HTTP/1.1"

Example of successful redispatch(i think +1 denotes a redispatch)

<30>Jan  3 11:19:18 splunk_haproxy[2638905]: 192.168.114.15:57354 [03/Jan/2025:11:19:18.410] http-in~ hec-backend/hec_198.18.10.6 0/0/0/1/1 200 237 - - ---- 2/2/0/0/+1 0/0 "POST /services/collector/event HTTP/1.1"

adarragon · January 16, 2025, 4:03pm

Could it be that multiple servers fail under load so even if redispatch works, the 3 retries failed so haproxy eventually gave up?

adarragon · January 16, 2025, 4:05pm

Hmm, I think in that case you should see failing requests with /+x where +x would be the number of failed redispatched, and it doesn’t seem to be the case here

Since the “non-working” redispatch is related to a POST request, I suspect that maybe you are hitting the same limitation as described here?

Relevant doc:
https://docs.haproxy.org/dev/configuration.html#retry-on

jaykb77 · January 17, 2025, 2:42pm

Thanks for your response. I tried to increase the tune.bufsize up to 6 times the default value and still no luck.
Interesting thing is that the %B part of the logs/bytes read is always 66 for the failing requests.

jaykb77 · January 17, 2025, 2:48pm

Pasting config

global
    maxconn 256
    log 127.0.0.1 local0
    tune.ssl.default-dh-param 2048
defaults
    log fd@1 daemon debug
    mode http
    log-format %ci:%cp\ [%tr]\ %ft\ %b/%s\ %TR/%Tw/%Tc/%Tr/%Ta\ %ST\ %B\ %CC\ %CS\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %hr\ %hs\ %{+Q}r
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

    timeout queue 60000ms
    timeout http-request 15000ms
    timeout http-keep-alive 15000ms
    option redispatch
    option forwardfor
    option http-server-close

frontend http-in
    bind *:80
    bind *:443 ssl crt /a/file.pem
    redirect scheme https if !{ ssl_fc }
    default_backend splunk_servers
    use_backend hec-backend if { path_beg -i /services/collector/event }

And I was simulating a bad backend by running netcat as below on one backend server

while true; do echo -e "HTTP/1.1 503 Service Unavailable\r\nContent-Length: 0\r\n\r\n" | nc -l 8088; done

jaykb77 · January 17, 2025, 2:50pm

For clarity, I had removed healthcheck config to run this test as healthcheck wont let me simulate the scenario with netcat

    option httpchk
    http-check send meth GET  uri /services/collector/health

Sybren · March 13, 2025, 10:52pm

Why not use queueing ?
It might add a few ms to a request, but it should be good enough for peak connections.

Topic		Replies	Views
Haproxy retries using option redispatch Help!	2	2190	March 8, 2018
Haproxy doesn't retry to another backend on 503 error Help!	4	3830	July 31, 2021
HAProxy redispatch option Help!	2	871	May 31, 2018
Redispatch does not work when balance source Help!	1	1176	July 11, 2017
Feature request: retry-on/redispatch Help!	1	503	March 28, 2021

Haproxy retry and redispatch not working as expected

Related topics