Reloads of haproxy 1.8.3 under SystemD cause "finishing" children to stack-up

rh20i · January 11, 2018, 9:12am

Hi everyone

I work for a large-ish UK web hosting company and we’re slowly introducing HTTP/2 now that’s it’s in haproxy 1.8.

We’re using haproxy 1.8.3 under SystemD on CentOS 7.4 (3.10.0-693.el7.x86_64) KVM virtual machines, built with the following make line:

%{__make} CPU="generic" TARGET=linux2628 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1

HTTP/2 is enabled for both of our frontend’s IPv4 and IPv6 “bind” lines:

frontend httpexternal
    bind *:80
    bind *:443 ssl crt /etc/haproxy/stackcerts/www.stackssl.com.pem ssl crt /etc/haproxy/certs/ alpn h2,http/1.1
    bind :::80
    bind :::443 ssl crt /etc/haproxy/stackcerts/www.stackssl.com.pem ssl crt /etc/haproxy/certs/ alpn h2,http/1.1

We have observed behavior where haproxy reloads cause a situation where the “finishing” PID never actually finishes, despite lsof showing that it has no active TCP connections.

If we disable HTTP/2 completely (IPv4 and IPv6) the problem goes away, and the finishing PIDs do indeed go away when they have no more connections.

After lots of lsof’ing later, we noticed that a child we expect to have finished had this line in lsof output:

haproxy 26670 haproxy  429u     sock      0,7      0t0 31631114 protocol: TCPv6

but no actual IPv6 connections. We see the same behavior for “finishing” processes that never go away that are for IPv4. It has a high-numbered File Descriptor which makes me think it was from a connection that was being used to serve HTTP requests.

The same behaviour occurs if we’re in single-threaded or multi-threaded mode (modified through config, not recompiling).

We think http://git.haproxy.org/?p=haproxy-1.8.git;a=commit;h=4dbce456a223de3d06873828185ba789d5043def might be related in some way.

I hope this report helps.

Please let me know if you require any further information.

lukastribus · January 11, 2018, 6:43pm

A few questions:

can you share your configuration regarding the timeouts?
are you using systemd in notify mode (starting with -Ws and Type=notify, as per contrib/systemd/haproxy.service.in)?
do you use hard-stop-after? You should set that in any case to something that matches your expections, otherwise a small client or attacker can keep your old processes handing around forever (just to be clear, that doesn’t mean this couldn’t be a bug)

Also there are two important bugs you will hit in a hosting environment with 1.8.3:

H2 issue in in IE11/Edge browsers: http://git.haproxy.org/?p=haproxy-1.8.git;a=commit;h=646d23d1b502bc07a4a846f2ca7d332506b3087e
SSL cache failure: http://git.haproxy.org/?p=haproxy-1.8.git;a=commit;h=52a80823e8c2d04635cc95e5d0ca9440a53441cf

Both of those bugs are important in a hosting environment imho, and your issue could be related to the H2 issue with empty data frames from IE11/Edge clients.

I suggest you upgrade to latest 1.8 git tree or, if you prefer a tarball, to yesterdays 1.8 snapshot:
http://www.haproxy.org/download/1.8/src/snapshot/haproxy-ss-20180110.tar.gz

cheers,
lukas

rh20i · January 16, 2018, 8:04am

Hi Lukas

Timeouts (in global)

    timeout connect         10s
    timeout queue           30s
    timeout client          1m
    timeout server          5m
    timeout check           10s
    timeout http-request    10s

Invocation - I did use the values from contrib/systemd/haproxy.service.in but it looks like I missed Type=notify but everything else is correct.

We are not using hard-stop-after.

I will re-try with latest snapshot, Type=notify and hard-stop-after enabled and get back to you.

Thank you very much.

rh20i · January 16, 2018, 9:46am

I forgot to mention this - we see negative conns in the general process information section sometimes:

pid = 13087 (process #1, nbproc = 1)
uptime = 0d 0h13m59s
system limits: memmax = unlimited; ulimit-n = 20137
maxsock = 20137; maxconn = 10000; maxpipes = 0
current conns = -1384; current pipes = 0/0; conn rate = 109/sec
Running tasks: 1/1150; idle = 43 %

That’s 1 process and no threading.

…

And sometimes it’s fine:

pid = 15130 (process #1, nbproc = 1)
uptime = 0d 0h02m15s
system limits: memmax = unlimited; ulimit-n = 20137
maxsock = 20137; maxconn = 10000; maxpipes = 0
current conns = 736; current pipes = 0/0; conn rate = 101/sec
Running tasks: 1/1039; idle = 48 %

rh20i · January 17, 2018, 8:43am

Hi Lukas

I can confirm that adding Type=notify and using the recommended haproxy version did not solve the problem.

Adding hard-stop-after did fix it, but we’d prefer not to use that: we’re happy to allow customers to run download sites and we wouldn’t want to ultimately impose a maximum time limit on how long a download can take. That said, we don’t believe it’s download sites causing this issue because the processes that remain don’t have any open connections to the Internet.

Any other ideas?

Thank you for your help so far.

Topic		Replies	Views
Problem with reload haproxy Help!	1	2367	June 1, 2020
TCP connection keep old process UP on soft reload Help!	2	2332	August 23, 2018
HAProxy processes wedged with tcp session in CLOSE_WAIT when http/2 enabled Help!	2	2422	May 19, 2018
HAProxy 2.4.17 reload issue Help!	2	483	September 8, 2022
Seamless Reloads don't work with systemd Help!	19	17491	September 19, 2018

Reloads of haproxy 1.8.3 under SystemD cause "finishing" children to stack-up

Related topics