Hi, we’ve observed an increase in backend connect time on all backends under the same frontend as soon as one starts to queue.
Just wondering if this is an expected behaviour or our configuration would benefit from some meaningful tuning?
For this example, bck1 starts to queue up to 10k+ requests on traffic spikes which is not an issue here but we can see at the time that other backends like bck2 (no queue, same load as before) starts to have larger backend connect times which then has a repercussion of processing less traffic (waiting for requests). Latency can spike from nominal 10ms up to 3s…
Some examples of the current configuration:
We are using [HAProxy version 1.8.12-8a200c7, released 2018/06/27]. Running on 1 process and thread.
pid = 14332 (process #1, nbproc = 1, nbthread = 1)
system limits: memmax = unlimited; ulimit-n = 200039
maxsock = 200039; maxconn = 100000; maxpipes = 0
Total number of connections established are way under the set limits (25k/req at max) and queue gets to (10-15k)
defaults
mode http
timeout client 10s
timeout server 1800s
timeout connect 10s
timeout queue 60s
timeout check 1s
load-server-state-from-file global
option forwardfor
no option forceclose
no option tcp-smart-accept
option http-no-delay
log global
log 127.0.0.1 len 5120 local0
option dontlognull
balance leastconn
frontend api
bind :80
bind :443 ssl crt ./
maxconn 100000
option http-buffer-request
…
backend bck1
fullconn 30000
option httpchk HEAD /rmi/status
http-check disable-on-404
server : check slowstart 20s maxconn 20
backend bck2
fullconn 30000
option httpchk HEAD /rmi/status
http-check disable-on-404
server : check slowstart 20s maxconn 500
server : check slowstart 20s maxconn 500
server : check slowstart 20s maxconn 500
Example of metrics from stress tests (no such huge spike as on production traffic but notable)
CPU busy is 100% on queue load on the haproxy servers itself and memory load jumps by ~400MB but there is still room to spare on the instances. Do note, CPU usage on production did not get above 30% on any node
Here are the metrics from the production traffic where you can clearly see in red when the queuing occurs, backend connect latency spikes on all backends. So is this an expected behavior or we can further tune the configuration to try and mitigate this? All of this traffic is http/https, haproxies are behind keepalived instances running least connection algorithm. Haven’t seen any difference using round-robin algo.
Thanks and any help appreciated!
EDIT:
adding verbose version output (all are haproxies are built on the same hardware with same build flags)
HA-Proxy version 1.8.12-8a200c7 2018/06/27
Copyright 2000-2018 Willy Tarreau willy@haproxy.orgBuild options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow -Wno-unused-label
OPTIONS = USE_LINUX_TPROXY=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200Built with OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with network namespace support.Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace