We use HAProxy to balance the load between hundreds of servers, however, there is an issue which is new for us and unexpected (works in any HAProxy version). When number of servers in some backend exceeds 100, all weights for active servers are set to zero (marked as SOFT STOPPED) and backend goes down. The only way to make backend UP again is to set server count to 100 or less. We may need to run hundreds of servers under one backed and distribute the load. How we can adjust this number?
Share the output of haproxy -vv
, the configuration and explain how you add/remove servers here (are you adjusting the configuration and reloading haproxy, are you using DNS discovery, etc).
Okay, so output of haproxy -vv
HA-Proxy version 2.2.1 2020/07/23 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.1.html
Running on: Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
Build options :
TARGET = linux-glibc
CPU = generic
CC = gcc
CFLAGS = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-stringop-overflow -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1
Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_THREADS=64, default=4).
Built with OpenSSL version : OpenSSL 1.1.1d 10 Sep 2019
Running on OpenSSL version : OpenSSL 1.1.1d 10 Sep 2019
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.32 2018-09-10
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 8.3.0
Built with the Prometheus exporter as a service
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
fcgi : mode=HTTP side=BE mux=FCGI
<default> : mode=HTTP side=FE|BE mux=H1
h2 : mode=HTTP side=FE|BE mux=H2
<default> : mode=TCP side=FE|BE mux=PASS
Available services :
prometheus-exporter
Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace
[CACHE] cache
[FCGI] fcgi-app
A configuration example is the next:
global
maxconn 100000
stats socket /var/run/haproxy.sock mode 660 level admin
stats timeout 30s
debug
defaults
log global
mode http
option httplog
option dontlognull
option forwardfor
timeout connect 5000
timeout client 50000000
timeout server 50000000
resolvers localdns
nameserver dns1 <ip>:53
accepted_payload_size 8192
hold timeout 600s
hold refused 600s
frontend http-in
bind *:8080 accept-proxy
mode http
redirect scheme https code 301 if !{ ssl_fc }
frontend https-in
bind *:8433 accept-proxy
mode http
# Define hosts
acl host_acl hdr(host) -i <address>
# Figure out which one to use
use_backend host if host_acl
backend host
http-request set-header X-Real-IP %[src];
http-request set-header X-Forwarded-For %[src];
http-request set-header X-Forwarded-Proto %[src];
http-request set-header Connection "upgrade"
http-request set-header Host %[src]
balance leastconn
stick-table type string len 80 size 1m expire 8h
option tcp-check
stick on url_param(mrid)
server-template srv 100 _http._tcp.<pod>.<ns>.svc.cluster.local resolvers localdns check inter 1000
As you can see, we are using DNS discovery.
Any logging happening at the same time? Could it be related to the dns response hitting the accepted_payload_size
threshold?
No valuable logs. Will try to debug further today. Servers are being updated even if we have 500 of them, but with 0 weight. So, problem is definitely not in DNS Payload size, moreover 8k is the maximum here.
What usually happened when new server is added (1-100):
[WARNING] 212/080533 (7) : Server host/srv48 is UP, reason: Layer4 check passed, check duration: 0ms. 48 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Here what if happened when I add server number 101:
[ALERT] 212/080658 (7) : backend 'host' has no server available!
[WARNING] 212/080658 (7) : host/srv101 changed its IP from to <IP> by DNS additional recrd.
[WARNING] 212/080700 (7) : Server host/srv101 is UP, reason: Layer4 check passed, check duration: 0ms. 0 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
After that in 1 minute worker dies
[ALERT] 212/080908 (1) : Current worker #1 (7) exited with code 139 (Segmentation fault)
8192 bytes, not servers, and with this numbers it’s not completely unlikely that we hit it. Crossing the threshold doesn’t necessarily mean that no servers are updated, especially if we hit a bug.
A Haproxy worker is crashing here, so this is clearly a bug.
Could you file a bug: