Haproxy retry and redispatch not working as expected

We have haproxy-2.2.6 and are using basic config for our backend hec-backend which is load balancing a REST api based service.

backend hec-backend
    mode http
    option httpchk
    http-check send meth GET  uri /services/collector/health
    server hec_192.168.10.5 192.168.10.5:8088 maxconn 16 weight 10  check port 8088
    server hec_192.168.10.6 192.168.10.6:8088 maxconn 16 weight 10  check port 8088
    server hec_192.168.10.7 192.168.10.7:8088 maxconn 16 weight 10  check port 8088

Recently we have been seeing intermittent 503 errors from some of the backend machines due to high load. While the permanent solution is to upgrade hardware. As this can take some time, we are trying to see if we can redispatch such requests away from the high load machines as a step to mitigate the issue.

We tried to add something like below to this backend

    retries 3
    retry-on all-retryable-errors 502 503 504
    option redispatch 1

The intention was to redispatch request to another backend machine whenever we see a failure, primarily 5xx errors. When testing this using curl, it worked as expected with low request rate, but we found that, during high request rate in an actual scenario, most of the requests still ends up with 503. It is not clear why haproxy is not redispatching requests.

Please share any thoughts on why this may be happening.

Example of error (while still using redispatch)

<30>Jan  3 11:14:16 splunk_haproxy[2638905]: 192.168.114.15:52148 [03/Jan/2025:11:14:16.907] http-in~ hec-backend/hec_192.168.10.5 0/0/0/0/0 503 66 - - ---- 4/2/0/0/0 0/0 "POST /services/collector/event HTTP/1.1"

Example of successful redispatch(i think +1 denotes a redispatch)

<30>Jan  3 11:19:18 splunk_haproxy[2638905]: 192.168.114.15:57354 [03/Jan/2025:11:19:18.410] http-in~ hec-backend/hec_198.18.10.6 0/0/0/1/1 200 237 - - ---- 2/2/0/0/+1 0/0 "POST /services/collector/event HTTP/1.1"

Could it be that multiple servers fail under load so even if redispatch works, the 3 retries failed so haproxy eventually gave up?

Hmm, I think in that case you should see failing requests with /+x where +x would be the number of failed redispatched, and it doesn’t seem to be the case here :confused:

Since the “non-working” redispatch is related to a POST request, I suspect that maybe you are hitting the same limitation as described here?

Relevant doc:
https://docs.haproxy.org/dev/configuration.html#retry-on

Thanks for your response. I tried to increase the tune.bufsize up to 6 times the default value and still no luck.
Interesting thing is that the %B part of the logs/bytes read is always 66 for the failing requests.

Pasting config

global
    maxconn 256
    log 127.0.0.1 local0
    tune.ssl.default-dh-param 2048
defaults
    log fd@1 daemon debug
    mode http
    log-format %ci:%cp\ [%tr]\ %ft\ %b/%s\ %TR/%Tw/%Tc/%Tr/%Ta\ %ST\ %B\ %CC\ %CS\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %hr\ %hs\ %{+Q}r
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

    timeout queue 60000ms
    timeout http-request 15000ms
    timeout http-keep-alive 15000ms
    option redispatch
    option forwardfor
    option http-server-close

frontend http-in
    bind *:80
    bind *:443 ssl crt /a/file.pem
    redirect scheme https if !{ ssl_fc }
    default_backend splunk_servers
    use_backend hec-backend if { path_beg -i /services/collector/event }

And I was simulating a bad backend by running netcat as below on one backend server

while true; do echo -e "HTTP/1.1 503 Service Unavailable\r\nContent-Length: 0\r\n\r\n" | nc -l 8088; done

For clarity, I had removed healthcheck config to run this test as healthcheck wont let me simulate the scenario with netcat

    option httpchk
    http-check send meth GET  uri /services/collector/health