HAProxy community

2.0.1 cpu Usage at near 100% after upgrade from 1.5

I’m on a 2 core machine with 4 gigs of memory
I have 11 different configs each on its own systemd process to isolate services.
cpu never went above 30% on 1.5 (default available in the CentOS7 repo)
I built 2.0.1 rpm updated the systemd files and no changes to the configs, now on start the cpu spikes and stays there.

Should I be configuring things differently for 2.0.1? or is this just a bug and I need to install another version/patch

uname -a
Linux proxy0 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

HA-Proxy version 2.0.1 2019/06/26 - https://haproxy.org/
Build options :
TARGET = linux-glibc
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_THREAD=1 USE_REGPARM=1 USE_LINUX_TPROXY=1 USE_OPENSSL=1 USE_ZLIB=1 USE_TFO=1 USE_NS=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +REGPARM -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO +OPENSSL -LUA +FUTEX +ACCEPT4 -MY_ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=2).
Built with OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as cannot be specified using ‘proto’ keyword)
h2 : mode=HTX side=FE|BE mux=H2
h2 : mode=HTTP side=FE mux=H2
: mode=HTX side=FE|BE mux=H1
: mode=TCP|HTTP side=FE|BE mux=PASS

Available services : none

Available filters :
[SPOE] spoe
[COMP] compression
[CACHE] cache
[TRACE] trace

Please upgrade to 2.0.2, it fixes a ton of issues, including CPU related stuff.

I updated… no change in the cpu … although when I stop one of the services it drops to less then 10% then climbs again back to 99-100%

here is the config perhaps I need to alter it?

global
        daemon
        user haproxy
        group haproxy

defaults
        mode http
        maxconn 10000
        timeout connect 5000
        timeout client 50000
        timeout server 50000

listen stats
        bind 10.0.0.4:1978
        stats enable
        stats realm Haproxy\ Statistics\ RabbitMQ
        stats uri /
        stats refresh 5s

# RabbitMQ
listen rabbit
        bind 10.0.0.4:5672 v4v6
        balance roundrobin
        mode tcp
        option tcp-check

        server rabbit-1  10.0.0.1:5672    check inter 2000 rise 2 fall 3 send-proxy
        server rabbit-2  10.0.0.2:5672    check inter 2000 rise 2 fall 3 send-proxy
        server rabbit-3  10.0.0.3:5672    check inter 2000 rise 2 fall 3 send-proxy

This is most likely a bug, there is also a similar report on the mailing list:

https://www.mail-archive.com/haproxy@formilux.org/msg34558.html

Could you attach strace -tt -p<PID> to a process occupying 100% and provide a few seconds of it’s output (it will be large)? Are you able to reproduce this with nbthread 1 in your configuration?

CC’ing @willy

Thank you, I did the strace without the nbthread change. that did drop the cpu usage a lot though.
Either way though I am not able to pull up the stats page.


Excellent, thank you. So it shows that some data are announced as available, but not read. It could be a problem of a buffer full condition that is not properly handled. What is surprising is that in TCP mode the path between the fd and the upper stream is the shortest possible (we don’t even use muxes) so the reason for this must be a bit gross. Now we need to find a way to reproduce this.
It would be interesting to know if this also happens without checks so that we can tell whether it’s checks or regular traffic which is causing this.

Thank you for looking into this.

I just revered to 1.8 and its working. which made me realize I was only getting 504 connections now its 1528… it was blocking all those connections

Hello,

I am having a similar problem, as documented on the mailing list here: https://www.mail-archive.com/haproxy@formilux.org/msg34605.html

I have tried removing all checks and agents and still experience this problem with 2.0.3. Here is an example server line:

server s1 10.0.2.1:8080 weight 100 source 10.0.1.10

I will try and simplify things even further (by removing all over listen/frontend but this one entry) and report back. Please let me know if there is anything else you would prefer I try.

Ok, so I simplified everything. Removed all other services, just one simple HTTP load balancer between the front end application and the back end application. I removed threading from the picture and kept it to only a single process. I tried to make it as basic as possible. 2.0.3 still suffers from maxing out the cpu and dropping requests, when haproxy 1.6 does not.

Here are the configs:

global
        log /dev/log    local0 notice
        chroot /var/lib/haproxy

        stats socket /run/haproxy/haproxy_20.sock mode 664 level admin
        stats timeout 30s

        user haproxy
        group haproxy
        daemon

        nbproc 1
        nbthread 1

        maxconn 500000


defaults
        log     global
        mode    http

        option  dontlognull
        option  dontlog-normal
        option  redispatch

        option  tcp-smart-accept
        option  tcp-smart-connect

        timeout connect 2s
        timeout client  50s
        timeout server  50s
        timeout client-fin 1s
        timeout server-fin 1s

        maxconn 150000

        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http

listen back
        bind    10.0.0.249:8080    defer-accept
        bind    10.0.0.251:8080    defer-accept
        bind    10.0.0.252:8080    defer-accept
        bind    10.0.0.253:8080    defer-accept
        bind    10.0.0.254:8080    defer-accept
        mode    http

        maxconn 65000
        fullconn 65000

        balance leastconn
        http-reuse safe

        server  s1     10.0.6.11:8080  weight 100 source 10.0.1.100
        server  s2     10.0.6.12:8080  weight 100 source 10.0.1.101
        server  s3     10.0.6.13:8080  weight 100 source 10.0.1.102
        server  s4     10.0.6.14:8080  weight 100 source 10.0.1.103
        server  s5     10.0.6.15:8080  weight 100 source 10.0.1.100
        server  s6     10.0.6.16:8080  weight 100 source 10.0.1.101
        server  s7     10.0.6.17:8080  weight 100 source 10.0.1.102
        server  s8     10.0.6.18:8080  weight 100 source 10.0.1.103
        server  s9     10.0.6.19:8080  weight 100 source 10.0.1.100
        server  s10    10.0.6.20:8080  weight 100 source 10.0.1.101
        server  s11    10.0.6.21:8080  weight 100 source 10.0.1.102
        server  s12    10.0.6.22:8080  weight 100 source 10.0.1.103
        server  s13    10.0.6.23:8080  weight 100 source 10.0.1.100
        server  s14    10.0.6.24:8080  weight 100 source 10.0.1.101
1 Like

Given the very low fd number I strongly suspect it’s a listener that is looping like this. Now why is it looping like this ? I still have no idea. I’d be fine with it reaching a limit or something but it should disable polling, which is not done here. I’ll have another look at the accept() code to see if anything could cause one FD not to be properly disabled once a limit is reached.

Thanks!

1 Like

If there is a setting I need to increase, I can easily do so. This device only serves as a load balancer, so we can allocate whatever resources are necessary to the processes.

Were there any dramatic code changes between 1.6 -> 2.0 in the area you have a concern?

There were many changes between 1.6 and 2.0 in these areas. Threads, layered connections with muxes, accept-queues, idle connections etc are all possible candidates to justify a change of behaviour. But whatever the reason if an FD is waking up your process all the time without being handled, it is a bug that needs to be addressed. At the very least it should be disabled for the time needed for the issue to go. That’s what we need to figure.

I’m also seeing this/a similar issue with 2.0.3.

@rbrooker and @ngaugler are you able to reproduce this issue if you set “no option http-use-htx” in the defaults? It was changed to be the new default in “2.0-dev3”.
https://cbonte.github.io/haproxy-dconv/2.0/configuration.html#4.2-option%20http-use-htx

I quickly tested with and without ‘no option http-use-htx’ and saw 100% utilization with both. Reverting to 1.6 immediately fixed the issue. For the simplified version I am using no threading, although code changes may have been necessary to support threading this impacts both threading and no threading.

If there is anything else you need me to try please let me know. It’s very easy to reproduce. I am really quite surprised that everyone else doesn’t have this problem… other than volume of traffic I am not sure what I am doing differently.

We were using 1.9.8 without issues before upgrading. Have you seen the same problem with any version in the 1.9.X branch?

When the problem happens it would be nice to see the socket states:
$ ss -atn |cut -f1 -d’ ’ |sort|uniq -c
If you see some CLOSE_WAIT, please run ss -atn|grep CLOSE_WAIT and check whether they are from the client to haproxy or from haproxy to the server.

Also how to you start your haproxy process ? Are you using the master-worker system ? Does it immediately fail upon first startup or does it fail after some time, after processing some traffic, after a reload ? It would be interesting to know what FD the fd==5 socket corresponds to, this can be done using “ss -anp|grep -w ‘fd=5’” (assuming the fd is still 5).

I’m a bit bothered by this one because the only way not to accept a connection in the listeners code is to reach a configured limit, and with your config it will not happen for a while, so it must be something different.