Reloads of haproxy 1.8.3 under SystemD cause "finishing" children to stack-up


#1

Hi everyone

I work for a large-ish UK web hosting company and we’re slowly introducing HTTP/2 now that’s it’s in haproxy 1.8.

We’re using haproxy 1.8.3 under SystemD on CentOS 7.4 (3.10.0-693.el7.x86_64) KVM virtual machines, built with the following make line:

%{__make} CPU="generic" TARGET=linux2628 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1

HTTP/2 is enabled for both of our frontend’s IPv4 and IPv6 “bind” lines:

frontend httpexternal
    bind *:80
    bind *:443 ssl crt /etc/haproxy/stackcerts/www.stackssl.com.pem ssl crt /etc/haproxy/certs/ alpn h2,http/1.1
    bind :::80
    bind :::443 ssl crt /etc/haproxy/stackcerts/www.stackssl.com.pem ssl crt /etc/haproxy/certs/ alpn h2,http/1.1

We have observed behavior where haproxy reloads cause a situation where the “finishing” PID never actually finishes, despite lsof showing that it has no active TCP connections.

If we disable HTTP/2 completely (IPv4 and IPv6) the problem goes away, and the finishing PIDs do indeed go away when they have no more connections.

After lots of lsof’ing later, we noticed that a child we expect to have finished had this line in lsof output:

haproxy 26670 haproxy  429u     sock      0,7      0t0 31631114 protocol: TCPv6

but no actual IPv6 connections. We see the same behavior for “finishing” processes that never go away that are for IPv4. It has a high-numbered File Descriptor which makes me think it was from a connection that was being used to serve HTTP requests.

The same behaviour occurs if we’re in single-threaded or multi-threaded mode (modified through config, not recompiling).

We think http://git.haproxy.org/?p=haproxy-1.8.git;a=commit;h=4dbce456a223de3d06873828185ba789d5043def might be related in some way.

I hope this report helps.

Please let me know if you require any further information.


#2

A few questions:

  • can you share your configuration regarding the timeouts?
  • are you using systemd in notify mode (starting with -Ws and Type=notify, as per contrib/systemd/haproxy.service.in)?
  • do you use hard-stop-after? You should set that in any case to something that matches your expections, otherwise a small client or attacker can keep your old processes handing around forever (just to be clear, that doesn’t mean this couldn’t be a bug)

Also there are two important bugs you will hit in a hosting environment with 1.8.3:

Both of those bugs are important in a hosting environment imho, and your issue could be related to the H2 issue with empty data frames from IE11/Edge clients.

I suggest you upgrade to latest 1.8 git tree or, if you prefer a tarball, to yesterdays 1.8 snapshot:
http://www.haproxy.org/download/1.8/src/snapshot/haproxy-ss-20180110.tar.gz

cheers,
lukas


#3

Hi Lukas

Timeouts (in global)

    timeout connect         10s
    timeout queue           30s
    timeout client          1m
    timeout server          5m
    timeout check           10s
    timeout http-request    10s

Invocation - I did use the values from contrib/systemd/haproxy.service.in but it looks like I missed Type=notify but everything else is correct.

We are not using hard-stop-after.

I will re-try with latest snapshot, Type=notify and hard-stop-after enabled and get back to you.

Thank you very much.


#4

I forgot to mention this - we see negative conns in the general process information section sometimes:

pid = 13087 (process #1, nbproc = 1)
uptime = 0d 0h13m59s
system limits: memmax = unlimited; ulimit-n = 20137
maxsock = 20137; maxconn = 10000; maxpipes = 0
current conns = -1384; current pipes = 0/0; conn rate = 109/sec
Running tasks: 1/1150; idle = 43 %

That’s 1 process and no threading.

And sometimes it’s fine:

pid = 15130 (process #1, nbproc = 1)
uptime = 0d 0h02m15s
system limits: memmax = unlimited; ulimit-n = 20137
maxsock = 20137; maxconn = 10000; maxpipes = 0
current conns = 736; current pipes = 0/0; conn rate = 101/sec
Running tasks: 1/1039; idle = 48 %

#5

Hi Lukas

I can confirm that adding Type=notify and using the recommended haproxy version did not solve the problem.

Adding hard-stop-after did fix it, but we’d prefer not to use that: we’re happy to allow customers to run download sites and we wouldn’t want to ultimately impose a maximum time limit on how long a download can take. That said, we don’t believe it’s download sites causing this issue because the processes that remain don’t have any open connections to the Internet.

Any other ideas?

Thank you for your help so far.