Can't switch to backup server

Can’t switch to backup server

When I set the health check to fail while requests were flowing to the primary server (e.g. by stopping the primary server’s service (port:8080)),
I was hoping that the requests would flow to the server with the backup option set,but all requests returned a 503 error.

The log says “Running on backup”, but it doesn’t switch to the backup server.

2024-12-16 17:34:00.023  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1045ms, status: 9/10 UP.
2024-12-16 17:34:02.381  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 9/10 UP.
2024-12-16 17:34:04.054  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1029ms, status: 8/10 UP.
2024-12-16 17:34:05.383  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 8/10 UP.
2024-12-16 17:34:07.055  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 7/10 UP.
2024-12-16 17:34:09.431  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1047ms, status: 7/10 UP.
2024-12-16 17:34:10.057  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 6/10 UP.
2024-12-16 17:34:13.463  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1031ms, status: 6/10 UP.
2024-12-16 17:34:14.102  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1045ms, status: 5/10 UP.
2024-12-16 17:34:17.102  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 4/10 UP.
2024-12-16 17:34:17.495  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1031ms, status: 5/10 UP.
2024-12-16 17:34:20.104  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 3/10 UP.
2024-12-16 17:34:20.495  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 4/10 UP.
2024-12-16 17:34:23.105  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 2/10 UP.
2024-12-16 17:34:23.496  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 3/10 UP.
2024-12-16 17:34:26.105  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 1/10 UP.
2024-12-16 17:34:26.498  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 2/10 UP.
2024-12-16 17:34:29.107  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 0/5 DOWN.
2024-12-16 17:34:29.107  Server be_default/server2 is DOWN. 1 active and 2 backup servers left. 36 sessions active, 0 requeued, 0 remaining in queue.
2024-12-16 17:34:29.498  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 1/10 UP.
2024-12-16 17:34:33.559  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1060ms, status: 0/5 DOWN.
2024-12-16 17:34:33.559  Server be_default/server1 is DOWN. 0 active and 2 backup servers left. Running on backup. 37 sessions active, 0 requeued, 0 remaining in queue.

Even though the above log is output, all requests will result in a 503 error.

2024-12-16 17:34:33.856  172.16.0.1:40000 [16/Dec/2024:17:34:27.223] fe_all be_default/sever2 0/1599/-1/-1/6633 503 217 - - SC-- 1429/1429/1428/10/3 0/44 "POST /app/app HTTP/1.1" 
2024-12-16 17:34:33.985  172.16.0.1:47000 [16/Dec/2024:17:34:32.385] fe_all be_default/<NOSRV> 0/1599/-1/-1/1599 503 217 - - sQ-- 1419/1419/1417/0/0 0/20 "POST /app/app HTTP/1.1" 

Is the setting incorrect?

Supplementary information

  • If I stop the haproxy service while the primary server service is stopped, and then start it after a certain time, requests will flow to the backup server.
  • If I stop the primary server service while no requests are being sent, and then send requests after that, they will flow to the backup server.

version: 2.6.13
config : as below

###
### /etc/haproxy/haproxy.cfg
###

# Basic config mapping a listening IP:port to another host's IP:port with
# support for HTTP/1 and 2.

global
    chroot                  /var/lib/haproxy
    log                     127.0.0.1 daemon
    pidfile                 /var/run/haproxy.pid
    user                    haproxy
    group                   haproxy
    quiet
    nbthread                1
    stats socket            /var/lib/haproxy/stats.socket user app group appgrp level admin
    stats timeout           2m


defaults
    mode                    http
    log                     global
    option                  log-health-checks
    option                  httplog
    option                  httpclose
    retries                 3
    timeout http-request    10s
    timeout connect         3s
    timeout client          1m
    timeout http-keep-alive 10s
    timeout check           3s

frontend fe_all
    bind *:8080
    default_backend         be_default
    maxconn                 1000000

backend be_default
    balance                 roundrobin
    option                  httpchk GET /app/healthCheck
    option                  allbackups
    timeout queue           1600ms
    timeout server          1600ms
    server                  servier1 server1.net:8080 maxconn 100 check inter 3s fall 10 rise 5
    server                  servier2 server2.net:8080 maxconn 100 check inter 3s fall 10 rise 5
    server                  servier3 server3.net:8080 maxconn 100 check inter 3s fall 10 rise 5 backup
    server                  servier4 server4.net:8080 maxconn 100 check inter 3s fall 10 rise 5 backup

#--EOF--

Are failing requests in-flight requests at the moment the primary server is stopped?

When you shut the service, haproxy will not immediately realize it is DOWN because with fall set to 10 it will take 10 failing health checks for haproxy to consider the server is DOWN.

Thus haproxy may still route request to servers that are still considered UP. If the request fails, haproxy will retry on the same server UP to 3 times (the default), unless “retries” or “option redispatch” are configured. (ie: perhaps the server is temporarily unavailable? and by default haproxy tries to stick to the same server for consistency)

In your case, I invite you to take a look at those 2 options and configure them to adjust haproxy behavior:
https://docs.haproxy.org/dev/configuration.html#4-option%20redispatch
https://docs.haproxy.org/dev/configuration.html#4.2-retries

Also, with “observe” server keyword you can tell haproxy to adjust the health of the server based on failing requests instead having to wait for all healthchecks to fail before considering the server is DOWN. This can help to gain in responsiveness.

Also, if it’s planned maintenance, you can as well explicitly disable the server from haproxy using the runtime API so that haproxy immediately stops from routing requests to this server.

Thank you for your reply.

Are failing requests in-flight requests at the moment the primary server is stopped?

No, the problem is not with the requests that failed when the primary server stopped,
but with the requests that continued to fail even after the primary server stopped.

After the primary server stopped, the health check failed 10 times and the primary server was judged to be DOWN, and then requests continued to fail and were not restored.

I’m sorry for the misunderstanding.

Since the failing requests after the server is considered as DOWN by haproxy seem to be POST requests: if they were initiated before the primary server was stopped, then perhaps they are not retried because haproxy is unable to retry requests that don’t fit in the global buffer:

Relevant doc:
https://docs.haproxy.org/dev/configuration.html#retry-on

Can you check if problematic requests are mostly related to POST requests?

I’ll check. Please wait.
However, as I recall, requests executed after the primary server was stopped also failed, so I wonder if this is also because the requests are not stored in the global buffer and are not retried…

I also tried sending a GET request, but we also received a 503 error.
I also confirmed that requests made after stopping the primary server, as mentioned above, also returned a 503 error.

Can we get a peek at haproxy logs with failing GET requests?

The log for a failed GET request is as follows:
2025-01-21 17:36:44.806 172.16.0.1:40000 [21/Jan/2025:17:36:41.806] fe_all be_get/<NOSRV> 0/3000/-1/-1/3000 503 217 - - sQ-- 3722/3722/2411/0/0 0/1 "GET /app/get HTTP/1.1"
Although the backend settings for GET requests are separate, they have been omitted in the description.
The settings are shown below again.

###
### /etc/haproxy/haproxy.cfg
###

# Basic config mapping a listening IP:port to another host's IP:port with
# support for HTTP/1 and 2.

global
    chroot                  /var/lib/haproxy
    log                     127.0.0.1 daemon
    pidfile                 /var/run/haproxy.pid
    user                    haproxy
    group                   haproxy
    quiet
    nbthread                1
    stats socket            /var/lib/haproxy/stats.socket user app group appgrp level admin
    stats timeout           2m


defaults
    mode                    http
    log                     global
    option                  log-health-checks
    option                  httplog
    option                  httpclose
    retries                 3
    timeout http-request    10s
    timeout connect         3s
    timeout client          1m
    timeout http-keep-alive 10s
    timeout check           3s

frontend fe_all
    bind *:8080
    default_backend         be_default
    maxconn                 1000000
    acl                     acl_get path_beg /app/get
    use_backend             be_get if acl_get

backend be_default
    balance                 roundrobin
    option                  httpchk GET /app/healthCheck
    option                  allbackups
    timeout queue           1600ms
    timeout server          1600ms
    server                  servier1 server1.net:8080 maxconn 100 check inter 3s fall 10 rise 5
    server                  servier2 server2.net:8080 maxconn 100 check inter 3s fall 10 rise 5
    server                  servier3 server3.net:8080 maxconn 100 check inter 3s fall 10 rise 5 backup
    server                  servier4 server4.net:8080 maxconn 100 check inter 3s fall 10 rise 5 backup

backend be_get
    balance                 roundrobin
    option                  httpchk GET /app/healthCheck
    option                  allbackups
    timeout queue           3000ms
    timeout server          3000ms
    server                  servier1 server1.net:8080 maxconn 200 check inter 3s fall 10 rise 5
    server                  servier2 server2.net:8080 maxconn 200 check inter 3s fall 10 rise 5
    server                  servier3 server3.net:8080 maxconn 200 check inter 3s fall 10 rise 5 backup
    server                  servier4 server4.net:8080 maxconn 200 check inter 3s fall 10 rise 5 backup

#--EOF--

Here are some additional findings:

  • If the request flow is low, when I stop the primary server the requests flow to the backup server.
  • If I make timeout queue short enough, when I stop the primary server the requests flow to the backup server.

Did you try with “option redispatch” set on the backend?

Also you may want to try the maxqueue server keyword because by default server queue is unlimited. Since you set maxconn to 200, requests may stack UP when the server is going DOWN until it is really considered as DOWN by haproxy
See: https://docs.haproxy.org/dev/configuration.html#5.2-maxqueue

  • option redispatch
    I tried this, but it did not solve the issue. I tried retries 1 and 3.

  • maxqueue
    I tried this, but it did not solve the issue.
    I tried sizes of (be_default, be_get)=(5, 10), (100, 200).