Performance problem with nbthread

Hi all,

I need an advice regarding multithreading configuration.

We are using haproxy 2.0.12 on a CentOS8 virtual machine (VMware) with 16 GB of RAM, 4 vCPUs (1 core each) and 1GBPs NIC. This setup has only one frontend (http mode, SSL only) and two backends (http). Currently, the traffic is quite small: we saw maximum up to 400 concurrent connections, with maximal request rate of 42 connections/sec and maximal consumed network bandwidth of 10 MBit/s. But in future we need to handle up to ~5000 concurrent connections, maybe 10000.

And here begins the problem: with the current setup, haproxy consumes up to 35% of CPU power when nbproc 1 and nbthread 4 are set. As soon as I comment out nbthread line and switch to nbproc 4, the CPU load disappears completely: maximum 2% of all 4 CPU’s power is used by haproxy.

I would leave it “as is” with nbproc, but it causes some problems with “independent” stick tables, dedicated stats page for each process etc… So I definitely need to use multithreading.

Here is our config:

global
    maxconn         10000
    stats socket    /var/run/haproxy.stat mode 600 level admin
    log             127.0.0.1:514 local2
    chroot          /var/empty
    pidfile         /var/run/haproxy.pid
    user            haproxy
    group           haproxy
    ssl-default-bind-options no-tlsv13
    ssl-default-bind-ciphers 'HIGH:!aNULL:!MD5'
    tune.ssl.default-dh-param 4096
    tune.ssl.cachesize 1000000
    tune.ssl.lifetime 600
    tune.ssl.maxrecord 1460
    nbproc 1
    nbthread 4
    daemon

defaults
    option contstats
    retries 3 

frontend WEB
    bind            192.168.0.25:80
    bind            192.168.0.25:443 ssl crt /Certs/domain1.pem crt /Certs/domain2.pem
    mode            http
    timeout         http-request 5s 
    timeout         client 30s
    log             global
    option          httplog
    option          dontlognull
    option          forwardfor
    monitor-uri     /healthcheck
    maxconn         8000
    timeout client  30s
    http-request capture req.hdr(Host) len 20

    %%%Some ACLs are defined here%%%

    http-response set-header Strict-Transport-Security "max-age=63072000; includeSubdomains; preload"
    http-response set-header X-Frame-Options "SAMEORIGIN"
    http-response set-header X-XSS-Protection "1; mode=block"
    http-response set-header X-Content-Type-Options "nosniff"
    http-response set-header X-Permitted-Cross-Domain-Policies "none"
    http-response set-header X-Robots-Tag "all"
    http-response set-header X-Download-Options "noopen"

    # Do not allow more than 10 concurrent tcp connections per IP, or 15 connections in 3 seconds
    tcp-request content reject if { src_conn_rate(Abuse) ge 15 }
    tcp-request content reject if { src_conn_cur(Abuse) ge 10 }
    tcp-request connection track-sc1 src table Abuse

    # Redirect HTTP to HTTPS
    redirect        scheme https code 301 if !{ ssl_fc } 
    default_backend Web-Pool


backend Web-Pool
    mode            http
    balance         roundrobin
    retries         2
    option redispatch
    timeout connect 5s
    timeout server  30s
    timeout queue   30s
    option forwardfor
    option httpchk  HEAD /
    http-check      expect status 200
    cookie          DYNSRV insert indirect nocache
    fullconn        4000 
    http-request set-header X-Client-IP %[src]
    server          httpd01 192.168.0.30:80 check weight 1 inter 2000 rise 2 fall 2 minconn 0 maxconn 0 on-marked-down shutdown-sessions
    server          httpd02 192.168.0.31:80 check weight 2 inter 2000 rise 2 fall 2 minconn 0 maxconn 0 on-marked-down shutdown-sessions

backend Abuse
    stick-table type ip size 1m expire 30m store conn_rate(3s),conn_cur,gpc0,http_req_rate(10s),http_err_rate(20s)

With multi-process config, I use the following settings:
nbproc 4
cpu-map 1 0
cpu-map 2 1
cpu-map 3 2
cpu-map 4 3

I believe something is just wrong in my configuration… Could anybody help me to find the cause of this problem?

Thank you.

Make sure those 4 vCPUs are cores dedicated to this VM (and not preempted for other things), and that they are on the same NUMA node.

Problems that arise from preempting will definitely be worse with multithreading, when compared to multiple processes.

Try binding the threads to the respective cores in nbthread mode:
cpu-map auto:1/1-4 0-3

Also provide the output of haproxy -vv please.

I’m not sure you actually need parallel processing - unless you get DDoSed with a SSL handshake attack. Consider just using nbproc 1, nbthread 1 as a workaround. Of course this does not scale.

Hi!

Thank you, I will try the settings you suggested.

Basically I do. We have some apache servers that are often suffering from different kinds of DDoS attacks, and the idea is to put everything behind the haproxy.

Here is the output of haproxy -vv:

HA-Proxy version 2.0.12 2019/12/21 - https://haproxy.org/
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wno-implicit-fallthrough -Wno-stringop-overflow -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_THREAD=1 USE_REGPARM=1 USE_LINUX_TPROXY=1 USE_OPENSSL=1 USE_ZLIB=1 USE_TFO=1 USE_NS=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +REGPARM -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO +OPENSSL -LUA +FUTEX +ACCEPT4 -MY_ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=4).
Built with OpenSSL version : OpenSSL 1.1.1 FIPS  11 Sep 2018
Running on OpenSSL version : OpenSSL 1.1.1 FIPS  11 Sep 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE version : 8.42 2018-03-20
Running on PCRE version : 8.42 2018-03-20
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTX        side=FE|BE     mux=H2
              h2 : mode=HTTP       side=FE        mux=H2
       <default> : mode=HTX        side=FE|BE     mux=H1
       <default> : mode=TCP|HTTP   side=FE|BE     mux=PASS

Available services : none

Available filters :
        [SPOE] spoe
        [COMP] compression
        [CACHE] cache
        [TRACE] trace

Hi!

Unfortunately the cpumap setting did not help. CPU usage is still very high with quite low traffic (connection rate max: 640/s, session rate max: 520/s, request rate max: 330/s, maximal bandwidth: 15 MBit/s):
cpu

And the problem still disappears immediately when switching to multi-processing mode instead of multi-threading.

Any ideas?

Like I said, confirm that the vCPUs are dedicated to this VM and on the same NUMA node.

I have just checked: the haproxy VM uses only one NUMA node. CPU Affinity is not available in our setup, because we have no DRS enabled.

Unless you can guarantee that the cores are dedicated to this VM, you will not see reliable performance.

Sorry, but I do not understand.
First of all, haproxy shows pretty good performance in multi-process mode. Only multithreading mode causes performance problems.
Second, what is the difference between haproxy and other reverse proxies that can work on VMware without dedicated CPUs?

I still have impression that something is wrong with my configuration…

Like I said:

There is plenty of synchronization going on with nbthread, which is not the case in multi-process mode.

Threading is not comparable to multi-process mode, it’s a completely different beast.

That’s why the first thing todo when troubleshooting a CPU bottleneck on virtualization is to dedicate the cores (and use the same NUMA node).

You can go through the CPU usage section of the management guide:

https://cbonte.github.io/haproxy-dconv/2.0/management.html#7

But I really suggest to use dedicated cores here.

Is this still an “issue” in newer version of Haproxy?
Is it possible to have “classic” monitoring setup with prometheus node exporter and scrape all statistics at once?
We are using nbthreads but for some use cases, using multiple cpu core would greatly improve performance.

That’s what nbthread is for, it DOES use multiple CPU cores.

sorry, meant using nproc as in multiprocess

Multiprocess mode will always require separate stats sockets, etc. This is not going away. In fact this is the main pain point that multi-threading addresses.

1 Like