Can't switch to backup server

awano · January 14, 2025, 6:34am

Can’t switch to backup server

When I set the health check to fail while requests were flowing to the primary server (e.g. by stopping the primary server’s service (port:8080)),
I was hoping that the requests would flow to the server with the backup option set,but all requests returned a 503 error.

The log says “Running on backup”, but it doesn’t switch to the backup server.

2024-12-16 17:34:00.023  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1045ms, status: 9/10 UP.
2024-12-16 17:34:02.381  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 9/10 UP.
2024-12-16 17:34:04.054  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1029ms, status: 8/10 UP.
2024-12-16 17:34:05.383  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 8/10 UP.
2024-12-16 17:34:07.055  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 7/10 UP.
2024-12-16 17:34:09.431  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1047ms, status: 7/10 UP.
2024-12-16 17:34:10.057  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 6/10 UP.
2024-12-16 17:34:13.463  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1031ms, status: 6/10 UP.
2024-12-16 17:34:14.102  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1045ms, status: 5/10 UP.
2024-12-16 17:34:17.102  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 4/10 UP.
2024-12-16 17:34:17.495  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1031ms, status: 5/10 UP.
2024-12-16 17:34:20.104  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 3/10 UP.
2024-12-16 17:34:20.495  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 4/10 UP.
2024-12-16 17:34:23.105  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 2/10 UP.
2024-12-16 17:34:23.496  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 3/10 UP.
2024-12-16 17:34:26.105  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 1/10 UP.
2024-12-16 17:34:26.498  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 2/10 UP.
2024-12-16 17:34:29.107  Health check for server be_default/server2 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 0/5 DOWN.
2024-12-16 17:34:29.107  Server be_default/server2 is DOWN. 1 active and 2 backup servers left. 36 sessions active, 0 requeued, 0 remaining in queue.
2024-12-16 17:34:29.498  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 1/10 UP.
2024-12-16 17:34:33.559  Health check for server be_default/server1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 1060ms, status: 0/5 DOWN.
2024-12-16 17:34:33.559  Server be_default/server1 is DOWN. 0 active and 2 backup servers left. Running on backup. 37 sessions active, 0 requeued, 0 remaining in queue.

Even though the above log is output, all requests will result in a 503 error.

2024-12-16 17:34:33.856  172.16.0.1:40000 [16/Dec/2024:17:34:27.223] fe_all be_default/sever2 0/1599/-1/-1/6633 503 217 - - SC-- 1429/1429/1428/10/3 0/44 "POST /app/app HTTP/1.1" 
2024-12-16 17:34:33.985  172.16.0.1:47000 [16/Dec/2024:17:34:32.385] fe_all be_default/<NOSRV> 0/1599/-1/-1/1599 503 217 - - sQ-- 1419/1419/1417/0/0 0/20 "POST /app/app HTTP/1.1"

Is the setting incorrect?

Supplementary information

If I stop the haproxy service while the primary server service is stopped, and then start it after a certain time, requests will flow to the backup server.
If I stop the primary server service while no requests are being sent, and then send requests after that, they will flow to the backup server.

version: 2.6.13
config : as below

###
### /etc/haproxy/haproxy.cfg
###

# Basic config mapping a listening IP:port to another host's IP:port with
# support for HTTP/1 and 2.

global
    chroot                  /var/lib/haproxy
    log                     127.0.0.1 daemon
    pidfile                 /var/run/haproxy.pid
    user                    haproxy
    group                   haproxy
    quiet
    nbthread                1
    stats socket            /var/lib/haproxy/stats.socket user app group appgrp level admin
    stats timeout           2m


defaults
    mode                    http
    log                     global
    option                  log-health-checks
    option                  httplog
    option                  httpclose
    retries                 3
    timeout http-request    10s
    timeout connect         3s
    timeout client          1m
    timeout http-keep-alive 10s
    timeout check           3s

frontend fe_all
    bind *:8080
    default_backend         be_default
    maxconn                 1000000

backend be_default
    balance                 roundrobin
    option                  httpchk GET /app/healthCheck
    option                  allbackups
    timeout queue           1600ms
    timeout server          1600ms
    server                  servier1 server1.net:8080 maxconn 100 check inter 3s fall 10 rise 5
    server                  servier2 server2.net:8080 maxconn 100 check inter 3s fall 10 rise 5
    server                  servier3 server3.net:8080 maxconn 100 check inter 3s fall 10 rise 5 backup
    server                  servier4 server4.net:8080 maxconn 100 check inter 3s fall 10 rise 5 backup

#--EOF--

adarragon · January 16, 2025, 3:14pm

Are failing requests in-flight requests at the moment the primary server is stopped?

When you shut the service, haproxy will not immediately realize it is DOWN because with fall set to 10 it will take 10 failing health checks for haproxy to consider the server is DOWN.

Thus haproxy may still route request to servers that are still considered UP. If the request fails, haproxy will retry on the same server UP to 3 times (the default), unless “retries” or “option redispatch” are configured. (ie: perhaps the server is temporarily unavailable? and by default haproxy tries to stick to the same server for consistency)

In your case, I invite you to take a look at those 2 options and configure them to adjust haproxy behavior:
https://docs.haproxy.org/dev/configuration.html#4-option%20redispatch
https://docs.haproxy.org/dev/configuration.html#4.2-retries

Also, with “observe” server keyword you can tell haproxy to adjust the health of the server based on failing requests instead having to wait for all healthchecks to fail before considering the server is DOWN. This can help to gain in responsiveness.

Also, if it’s planned maintenance, you can as well explicitly disable the server from haproxy using the runtime API so that haproxy immediately stops from routing requests to this server.

awano · January 17, 2025, 4:05am

Thank you for your reply.

Are failing requests in-flight requests at the moment the primary server is stopped?

No, the problem is not with the requests that failed when the primary server stopped,
but with the requests that continued to fail even after the primary server stopped.

After the primary server stopped, the health check failed 10 times and the primary server was judged to be DOWN, and then requests continued to fail and were not restored.

I’m sorry for the misunderstanding.

adarragon · January 17, 2025, 8:51am

Since the failing requests after the server is considered as DOWN by haproxy seem to be POST requests: if they were initiated before the primary server was stopped, then perhaps they are not retried because haproxy is unable to retry requests that don’t fit in the global buffer:

Relevant doc:
https://docs.haproxy.org/dev/configuration.html#retry-on

Can you check if problematic requests are mostly related to POST requests?

awano · January 21, 2025, 12:37am

I’ll check. Please wait.
However, as I recall, requests executed after the primary server was stopped also failed, so I wonder if this is also because the requests are not stored in the global buffer and are not retried…

awano · January 22, 2025, 4:55am

I also tried sending a GET request, but we also received a 503 error.
I also confirmed that requests made after stopping the primary server, as mentioned above, also returned a 503 error.

adarragon · January 22, 2025, 9:05am

Can we get a peek at haproxy logs with failing GET requests?

awano · January 22, 2025, 9:46am

The log for a failed GET request is as follows:
2025-01-21 17:36:44.806 172.16.0.1:40000 [21/Jan/2025:17:36:41.806] fe_all be_get/<NOSRV> 0/3000/-1/-1/3000 503 217 - - sQ-- 3722/3722/2411/0/0 0/1 "GET /app/get HTTP/1.1"
Although the backend settings for GET requests are separate, they have been omitted in the description.
The settings are shown below again.

###
### /etc/haproxy/haproxy.cfg
###

# Basic config mapping a listening IP:port to another host's IP:port with
# support for HTTP/1 and 2.

global
    chroot                  /var/lib/haproxy
    log                     127.0.0.1 daemon
    pidfile                 /var/run/haproxy.pid
    user                    haproxy
    group                   haproxy
    quiet
    nbthread                1
    stats socket            /var/lib/haproxy/stats.socket user app group appgrp level admin
    stats timeout           2m


defaults
    mode                    http
    log                     global
    option                  log-health-checks
    option                  httplog
    option                  httpclose
    retries                 3
    timeout http-request    10s
    timeout connect         3s
    timeout client          1m
    timeout http-keep-alive 10s
    timeout check           3s

frontend fe_all
    bind *:8080
    default_backend         be_default
    maxconn                 1000000
    acl                     acl_get path_beg /app/get
    use_backend             be_get if acl_get

backend be_default
    balance                 roundrobin
    option                  httpchk GET /app/healthCheck
    option                  allbackups
    timeout queue           1600ms
    timeout server          1600ms
    server                  servier1 server1.net:8080 maxconn 100 check inter 3s fall 10 rise 5
    server                  servier2 server2.net:8080 maxconn 100 check inter 3s fall 10 rise 5
    server                  servier3 server3.net:8080 maxconn 100 check inter 3s fall 10 rise 5 backup
    server                  servier4 server4.net:8080 maxconn 100 check inter 3s fall 10 rise 5 backup

backend be_get
    balance                 roundrobin
    option                  httpchk GET /app/healthCheck
    option                  allbackups
    timeout queue           3000ms
    timeout server          3000ms
    server                  servier1 server1.net:8080 maxconn 200 check inter 3s fall 10 rise 5
    server                  servier2 server2.net:8080 maxconn 200 check inter 3s fall 10 rise 5
    server                  servier3 server3.net:8080 maxconn 200 check inter 3s fall 10 rise 5 backup
    server                  servier4 server4.net:8080 maxconn 200 check inter 3s fall 10 rise 5 backup

#--EOF--

awano · January 22, 2025, 9:54am

Here are some additional findings:

If the request flow is low, when I stop the primary server the requests flow to the backup server.
If I make timeout queue short enough, when I stop the primary server the requests flow to the backup server.

adarragon · January 22, 2025, 12:34pm

Did you try with “option redispatch” set on the backend?

adarragon · January 22, 2025, 12:44pm

Also you may want to try the maxqueue server keyword because by default server queue is unlimited. Since you set maxconn to 200, requests may stack UP when the server is going DOWN until it is really considered as DOWN by haproxy
See: https://docs.haproxy.org/dev/configuration.html#5.2-maxqueue

awano · January 24, 2025, 1:40am

option redispatch
I tried this, but it did not solve the issue. I tried retries 1 and 3.
maxqueue
I tried this, but it did not solve the issue.
I tried sizes of (be_default, be_get)=(5, 10), (100, 200).

awano · February 28, 2025, 6:13am

I confirmed that when I changed the timeout queue to 10 ms, requests flowed to the backup server as expected when the primary server stopped.
I also confirmed that the traffic volume in the production environment can be handled even with the timeout queue set to 10 ms.
I will set the timeout queue in the production environment to 10 ms.

However, I am not sure what the cause is.
I would appreciate it if @adarragon or anyone else could comment on the cause and whether the above solution is appropriate.

Topic		Replies	Views
Backup keyword not working correctly Help!	7	1085	June 18, 2020
High availability with replay (retry-on) and backup servers Help!	0	237	February 27, 2024
Best practices for upgrading application server Help!	5	488	June 16, 2021
Automatic failover without disrupting visitors Help!	0	394	September 23, 2022
Automattically getting out of backup server Help!	2	348	January 19, 2021

Can't switch to backup server

Related topics