Hi Willy & team - first off thank you for your amazing software - it’s been a life-saver.
Environment: We have a small cluster of HAProxy servers that have approximately 15k ssl certificates loaded. As certificates are added and removed, HAProxy is called to gracefully reload with the FINISH signal. This happens approximately 100 times a day and has worked perfectly across 1.6/1.7 and now 1.8.
Issue: we’ve recently enabled http/2 for these haproxy servers. Http/2 support has been great, but after enabling we’ve seen a small fraction of HAProxy processes never ending despite receiving a FINISH signal. This will slowly lead to memory exhaustion on the HAProxy servers.
Upon examination of the wedged processes, they always have 1 or more external sockets in CLOSE_WAIT:
tcp 26686 0 62.22.188.41:443 69.123.177.216:59710 CLOSE_WAIT 2335/haproxy
udp 0 0 0.0.0.0:49277 0.0.0.0:* 2335/haproxy
another one (different server:
tcp 85 0 62.22.188.40:443 49.204.95.150:53001 CLOSE_WAIT 12032/haproxy
tcp 43 0 62.22.188.40:443 43.248.55.131:56715 CLOSE_WAIT 12032/haproxy
udp 0 0 0.0.0.0:30855 0.0.0.0:* 12032/haproxy
We are using ‘nbproc’ - and the wedged processes seem to often be the ‘head’ process (hence udp binding), but this is not always the case:
tcp 841 0 62.22.188.41:443 132.170.15.255:42642 CLOSE_WAIT 18760/haproxy
Strace’ing the processes just shows a slow epoll_wait:
epoll_wait(3, [], 200, 0) = 0
epoll_wait(3, [], 200, 34) = 0
epoll_wait(3, [], 200, 0) = 0
epoll_wait(3, [], 200, 51) = 0
epoll_wait(3, [], 200, 0) = 0
epoll_wait(3, [], 200, 60) = 0
epoll_wait(3, [], 200, 0) = 0
Our configuration is very straightforward:
global
user haproxy
group haproxy
daemon
maxconn 21000
tune.ssl.default-dh-param 2048
tune.ssl.cachesize 1000000
tune.maxrewrite 16384
tune.bufsize 49152
nbproc 4
cpu-map 1 0
cpu-map 2 1
cpu-map 3 2
cpu-map 4 3
defaults
mode http
retries 5
option redispatch
maxconn 20000
timeout connect 30s
timeout client 30s
timeout server 14400s
timeout http-keep-alive 5s
option httplog
option dontlog-normal
option http-ignore-probes
log _ipaddr_ local3
option httpchk GET /admStatus/si/2
http-check expect status 200
option forwardfor
option http-keep-alive
The frontend just has a bind, set-header, and a default_backend.
Build options:
HA-Proxy version 1.8.8 2018/04/19
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>
Build options :
TARGET = linux2628
CPU = generic
CC = x86_64-pc-linux-gnu-gcc
CFLAGS = -O2 -march=native -pipe -fno-strict-aliasing
OPTIONS = USE_LIBCRYPT=1 USE_GETADDRINFO=1 USE_ZLIB=1 USE_THREAD=1 USE_OPENSSL=1 USE_PCRE=1 USE_TFO=1
Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2o 27 Mar 2018
Running on OpenSSL version : OpenSSL 1.0.2o 27 Mar 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.40 2017-01-11
Running on PCRE version : 8.41 2017-07-05
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace
This is running on a vanilla linux 4.9.6 kernel.
I confirmed that disabling http/2 in both 1.8.7 and 1.8.8 makes the issue go away. Curious if there’s anything else I might look at or whether this could be a bug. Thanks much!