Healthchecks broken? 1.8.8


#1

Hello,

We have trouble using 1.8.8, looks like the backend checks are not working correctly.

As soon as one backend server is being shutdown (in this case data-3-*) often one backend stays in the yellow mode forever (it even shows a negative check count). The service on all 10.0.1.2 backends is the same application, with multiple listeners on ports 80, 7081, 7082, …).

In the example haproxy is stuck, and does not take the data-3-8 backend offline.

stats:

Port ranges on the system are:

net.ipv4.ip_local_port_range = 10000    65534
net.ipv4.ip_local_reserved_ports = 1-10000

haproxy.cfg:

global
            daemon
            log-send-hostname
            log log.domain.net local0 info alert
            log log.domain.net local1 notice alert
            # SSL
            ca-base /etc/ssl/certs
            crt-base /etc/ssl/private
            ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDHE-RSA-AES256-GCM-SHA384:ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS
            ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11
            
            ssl-default-server-ciphers ECDH+AESGCM:DH+AESGCM:ECDHE-RSA-AES256-GCM-SHA384:ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS
            ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11
            
            stats socket /p/admin.sock mode 777 level admin expose-fd listeners
            
            maxconn 700000
            tune.ssl.default-dh-param 2048
            tune.ssl.cachesize 400000
            tune.ssl.lifetime 1200
            tune.bufsize 32768
            ulimit-n 900000
            
            nbproc 1
            nbthread 36
            cpu-map 1/1-6 0-5
            cpu-map 1/7-12 6-11
            cpu-map 1/13-18 12-17
            cpu-map 1/19-24 24-29
            cpu-map 1/25-30 30-35
            cpu-map 1/31-36 36-41
            
            ssl-engine rdrand  # intel engine available in openssl
            ssl-mode-async
    defaults
            log global                
            fullconn 4000000
            maxconn 700000
            mode    http
            option srvtcpka
            option tcpka
            option http-keep-alive
            option prefer-last-server
            option log-separate-errors
            option dontlognull
            max-keep-alive-queue 16000
            timeout connect 5s
            timeout client  300s
            timeout server  300s
            timeout queue   5s
            timeout tunnel  1h
            default-server inter 5s fall 3 rise 2 maxconn 40000

    frontend www-web
            bind WEB_IP:80 #allow-0rtt tfo alpn h2,http/1.1
            mode http
            timeout client 300s
            
            redirect scheme https
            
    frontend wwws-web
            bind WEB_IP:443 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind *:444 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind *:445 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind *:446 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind *:447 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind *:448 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind *:449 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind *:450 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind *:451 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind *:452 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind *:453 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            bind WEB_IP:944 ssl crt /etc/haproxy/server.pem allow-0rtt tfo alpn h2,http/1.1
            timeout client 1h
            
            acl host_api hdr(host) -i api.domain.com
            acl host_dyn hdr(host) -i dyn.domain.com
            acl host_ws hdr(host) -i ws.domain.com
            acl host_push hdr(host) -i push.domain.com
                    
            acl is_websocket_star hdr(Upgrade) -i WebSocket
            
            acl host_lbstats1 hdr(host) -i lbstats1.domain.com
            use_backend websocket-rest-push-backend if host_push
            
            #wss://*.domain.com
            use_backend websocket-rest-backend if is_websocket_star
            
            use_backend www-rest-backend if host_dyn
            use_backend www-rest-backend if host_api
            use_backend www-rest-backend if host_ws
            default_backend www-web-backend
    backend www-rest-backend
            balance roundrobin
            mode http
            option forwardfor
            option allbackups
            
            reqadd x-forwarded-proto:\ https
            option httpchk get /isOnline?type=rest
            
            server data2 10.0.1.1:80 check verify none maxconn 1000
            server data3 10.0.1.2:80 check verify none maxconn 1000
            
    backend websocket-rest-push-backend
            balance leastconn
            mode http
            option forwardfor
            option http-server-close
            option forceclose
            option allbackups
            no option httpclose
            option httpchk get /isOnline?type=ws
            fullconn 4000000
            
            server data2-1 10.0.1.1:80 check verify none
            server data2-2 10.0.1.1:7081 check verify none
            server data2-3 10.0.1.1:7082 check verify none
            server data2-4 10.0.1.1:7083 check verify none
            server data2-5 10.0.1.1:7084 check verify none
            server data2-6 10.0.1.1:7085 check verify none
            server data2-7 10.0.1.1:7086 check verify none
            server data2-8 10.0.1.1:7087 check verify none
            server data2-9 10.0.1.1:7088 check verify none
            server data2-10 10.0.1.1:7089 check verify none
            
            server data3-1 10.0.1.2:80 check verify none
            server data3-2 10.0.1.2:7081 check verify none
            server data3-3 10.0.1.2:7082 check verify none
            server data3-4 10.0.1.2:7083 check verify none
            server data3-5 10.0.1.2:7084 check verify none
            server data3-6 10.0.1.2:7085 check verify none
            server data3-7 10.0.1.2:7086 check verify none
            server data3-8 10.0.1.2:7087 check verify none
            server data3-9 10.0.1.2:7088 check verify none
            server data3-10 10.0.1.2:7089 check verify none
            
    backend websocket-rest-backend
            balance leastconn
            mode http
            option forwardfor
            option http-server-close
            option forceclose
            option allbackups
            no option httpclose
            option httpchk get /isOnline?type=ws
            server data1-1 10.0.1.6:80 check backup verify none
            
            server data2-1 10.0.1.1:80 check verify none
            server data2-2 10.0.1.1:7081 check verify none
            server data2-3 10.0.1.1:7082 check verify none
            server data2-4 10.0.1.1:7083 check verify none
            server data2-5 10.0.1.1:7084 check verify none
            server data2-6 10.0.1.1:7085 check verify none
            server data2-7 10.0.1.1:7086 check verify none
            server data2-8 10.0.1.1:7087 check verify none
            server data2-9 10.0.1.1:7088 check verify none
            server data2-10 10.0.1.1:7089 check verify none
            
            server data3-1 10.0.1.2:80 check verify none
            server data3-2 10.0.1.2:7081 check verify none
            server data3-3 10.0.1.2:7082 check verify none
            server data3-4 10.0.1.2:7083 check verify none
            server data3-5 10.0.1.2:7084 check verify none
            server data3-6 10.0.1.2:7085 check verify none
            server data3-7 10.0.1.2:7086 check verify none
            server data3-8 10.0.1.2:7087 check verify none
            server data3-9 10.0.1.2:7088 check verify none
            server data3-10 10.0.1.2:7089 check verify none
            
    listen stats1
        bind STATS_IP:8001 ssl crt /etc/haproxy/server.pem
        mode http
        stats enable
        stats refresh 7s
        stats realm Haproxy\ Statistics
        stats uri /haproxy?stats
        stats auth xxx:yyy
        default_backend www-web-backend

#2

I assume you only have this issue now, what release did you run previously that didn’t have this issue?


#3

Hey,

Yes this issue is new, before we used 1.7.9 with multi process option without any problem. We can easily reproduce it in our environment, it basically happens every time. If you require any more information on our setup, please let me know.


#4

Could you try 1.8.8 in multi process mode instead of threading? I would like to confirm that this is caused by nbthread.


#5

We have evaluated 1.8.8 today in multi process mode (nbproc 36) and were unable to reproduce the health check problem. So, as far as we can tell, it is caused by nbthread.


#6

Hi @scratchy ,

I sent a patch to fix a bug in checks: https://www.mail-archive.com/haproxy@formilux.org/msg29859.html

It is not related to threads. But symptoms are quite similar. Maybe it is easier to hit the bug with many threads. Could you test if this patch fixes your problem ? If not, I would look more carefully to understand what happens when many threads are running.


#7

Well, in fact, the previous patch will not fix your bug. I re-checked and finally reproduced it. This is really related to the threads. In fact, this only happens for more than 32 threads.

Here is the patch: https://www.mail-archive.com/haproxy@formilux.org/msg29864.html


#8

Good Catch! :slight_smile:

Thanks a lot, waiting for 1.8.9 to switch back to threads ~