2.2.3 multiprocess -> 3.0.8 multithreaded TLS performance degradation

Hi all, I’m working on trying to port existing haproxy instances from 2.2.3 using nbproc to 3.0.8 nbthread and hitting a (literal) TLS performance wall.

The hardware and underlying OS is identical, I’m just trying to update to a newer version and use a supported config:

  • AMD EPYC 9654P 96-Core Processor (hyperthreaded, no NUMA) 3.7GHz, 768GB RAM
  • Mellanox MT28908 ConnectX-6 NIC, 100GB DAC, FEC enabled
  • Ubuntu 22.04 (Linux version 5.15.0-60-generic)

The 2.2.3 config is simply using nbproc set to 64 with no other CPU config:

global
    user haproxy
    group haproxy
    daemon
    maxconn 450000
    nbproc 64
    server-state-file /tmp/haproxybackendstate
    set-dumpable
    ssl-dh-param-file /etc/ssl/ttd/dh_params
    ssl-mode-async
    stats maxconn 200
    tune.bufsize 32768
    tune.comp.maxlevel 2
    tune.ssl.cachesize 1000000
    tune.ssl.default-dh-param 2048
defaults
    load-server-state-from-file global
    maxconn 450000
    mode tcp
    retries 3
    timeout check 11s
    timeout client 16s
    timeout connect 10s
    timeout server 16s

and the 3.0.8 config I’ve ended up with for testing effectively tries to mirror that:

global
    user haproxy
    group haproxy
    daemon
    nbthread 64
    thread-groups 1
    cpu-map auto:1/1-64 0-63
    maxconn 450000
    server-state-file /tmp/haproxybackendstate
    set-dumpable
    ssl-dh-param-file /etc/ssl/ttd/dh_params
    ssl-mode-async
    stats maxconn 200
    stats socket /var/run/haproxystats.sock mode 600 level admin expose-fd listeners
    tune.bufsize 32768
    tune.comp.maxlevel 2
    tune.ssl.cachesize 1000000
    tune.ssl.default-dh-param 2048
    tune.listener.default-shards by-thread # https://docs.haproxy.org/3.0/configuration.html#3.2-  tune.listener.default-shards
    tune.listener.multi-queue fair # https://docs.haproxy.org/3.0/configuration.html#3.2-tune.listener.multi-queue

defaults
    load-server-state-from-file global
    maxconn 450000
    mode tcp
    retries 3
    timeout check 11s
    timeout client 16s
    timeout connect 10s
    timeout server 16s

the front/backend configs are the same for both:

frontend myservice443
    bind x.x.x.x:443 ssl crt /etc/ssl/mycompany/star_global_pub_priv
    default_backend myservice443
    mode http
    option forwardfor if-none

frontend myservice80
    bind x.x.x.x:80
    default_backend myservice80
    mode http
    option forwardfor if-none

backend myservice443
    balance random
    http-check expect status 200
    http-check send-state
    mode http
    option httpchk GET /service/health?from=lb
    server  backendpod1  172.18.95.118:443 ca-file /etc/ssl/mycompany/star_global_cert_ca check fall 4 inter 5s rise 3 slowstart 15s ssl verify none weight   100
    server  backendpod2  172.18.89.81:443 ca-file /etc/ssl/mycompany/star_global_cert_ca check fall 4 inter 5s rise 3 slowstart 15s ssl verify none weight   100
    server  backendpod3  172.18.85.74:443 ca-file /etc/ssl/mycompany/star_global_cert_ca check fall 4 inter 5s rise 3 slowstart 15s ssl verify none weight   100
    server  backendpod4  172.18.93.160:443 ca-file /etc/ssl/mycompany/star_global_cert_ca check fall 4 inter 5s rise 3 slowstart 15s ssl verify none weight   100

backend myservice80
    balance random
    http-check expect status 200
    http-check send-state
    mode http
    option httpchk GET /service/health?from=lb
    server  backendpod1  172.18.95.118:80 check fall 4 inter 5s rise 3 slowstart 15s weight   100
    server  backendpod2  172.18.89.81:80 check fall 4 inter 5s rise 3 slowstart 15s weight   100
    server  backendpod3 172.18.85.74:80 check fall 4 inter 5s rise 3 slowstart 15s weight   100
    server  backendpod4  172.18.93.160:80 check fall 4 inter 5s rise 3 slowstart 15s weight   100

(although I’ve also tried 3 thread groups with 64 processors in each)

and finally, these are the kernel tuning parameters for both systems:

linux_sysctl:
  net.ipv4.conf.all.proxy_arp: 0
  net.ipv4.tcp_window_scaling: 1
  net.ipv4.tcp_fin_timeout: 10
  net.ipv4.ip_forward: 1
  net.ipv4.conf.all.rp_filter: 2
  net.ipv4.conf.default.rp_filter: 2
  fs.file-max: 5000000
  fs.nr_open: 5000000
  net.ipv4.tcp_max_syn_backlog: 3240000
  net.core.somaxconn: 100000
  net.core.netdev_max_backlog: 100000
  net.ipv4.ip_local_port_range: "{{ 1024 + ansible_processor_vcpus }} 65535"
  net.netfilter.nf_conntrack_buckets: 425440
  net.netfilter.nf_conntrack_max: 10035200
  net.netfilter.nf_conntrack_tcp_timeout_close_wait: 20
  net.netfilter.nf_conntrack_tcp_timeout_fin_wait: 20
  net.netfilter.nf_conntrack_tcp_timeout_time_wait: 20
  net.ipv4.tcp_max_orphans: 5000000
  net.ipv4.conf.all.arp_ignore: 1
  net.ipv4.conf.default.arp_ignore: 1
  net.ipv4.conf.all.arp_announce: 2
  net.ipv4.conf.default.arp_announce: 2

The behavior is that within seconds of turning on traffic to the instance (via rack-level routing change):

  • 4xx and 5xx errors start climbing
  • CPU utilization climbs until it hits 100% on all assigned processors
  • If I disable the TLS frontend, the non-TLS proxy works fine with no performance issues
  • If I enable the TLS frontend, neither proxy is healthy (I don’t see the 4xx and 5xx errors spike as high on the non-TLS service, but it never achieves healthy traffic levels)

I have what is essentially an ideal production testing environment, where these SLBs are peering via BGP to the top of rack switch and I have effectively equal amounts of traffic going to them; in this rack, there are two SLBs total, one running the legacy 2.2.3 version with nbproc and the other with 3.0.8 running the nbthread config. The cert being used is a 2048 bit wildcard.

Any initial thoughts? My initial investigation seems to to point to receive queues being overloaded, I’ve tried things like setting maxconn to high values (e.g. 2M), same with ssl.cachesize (e.g. 64M), since those are shared for all threads.

Sorry, forgot some important details on current traffic levels and system stats.

TLS VIP (sustained levels):

  • Tx - ~25Mbps
  • Rx - ~300Mbps
  • Sessions - ~5k/s
  • Session rate - ~ 200/s
  • Request rate - 7k/s
  • Connection rate - ~ 175/s

Non-TLS VIP:

  • Tx - ~90Mbps
  • Rx - ~800Mbps
  • Sessions - ~3k/s
  • Session rate - ~ 30/s
  • Request rate - 25k/s
  • Connection rate - ~ 40/s

Total TCP connection rate for the healthy proxies is ~700/s, total connections ~11k. CPU average across 64 cores is < 2% utilization, memory < 10GB, disk I/O < 1%, total Tx/Rx throughput on 100Gbps NIC is < 2Gbps. The healthy systems have statistically insignificant numbers of proxy errors (i.e. 4xx/5xx), zero network transport errors.

When under production traffic pressure, the 3.0.8 proxy reports 4xx errors per second at > 2k and the TX/RX traffic levels spike to 15Gbps/8Gbps, and request rates jump to > 75k. Could this be entirely due to failed TLS handshakes and/or aggressive retries against an overloaded receive queue?

I don’t think there is a way to use 64 threads with acceptable performance while using TLS on both front and backend.

The reason is that multi threaded performance in OpenSSL 3 is still terrible after all those years:

Multi processes mode with OpenSSL 1.1.1 was really the best of all worlds performance wise. However this is not a choice we have today.

I would suggest you try with 8 or 16 threads and compare the results with your 64 thread performance. If 16 threads behaves better than 64 threads, you know that we are looking at thread contention.

Think about upgrading your certificate from RSA to ECC. That should make a huge difference for TLS <= v1.2 handshakes, while for TLSv1.3 it shouldn’t change anything.

Using a different SSL library will ultimately help here, but this requires compiling library and haproxy on your own:

AWS-LC and wolfssl would be the ones to look at, but do read the entire article.

I would also suggest enabling http-reuse to make sure backend connection or reused as efficiently as possible.

1 Like

So I did a couple of quick tests, and now I’m more confused. I did build using AWS-LC, which did result in the CPUs not being saturated, which helped me get a little further in troubleshooting. I took a break from looking at log files and did a simple experiment…I took the binary from the vbernat 2.2.3 package and just dropped it in as a replacement, with this config:

global
    user haproxy
    group haproxy
    daemon
    nbthread 64
#    thread-groups 1
#    cpu-map auto:1/1-64 0-63
    maxconn 2250000
    server-state-file /tmp/haproxybackendstate
    set-dumpable
    ssl-dh-param-file /etc/ssl/ttd/dh_params
#    ssl-mode-async
#    maxsslrate 2000
    stats maxconn 200
    stats socket /var/run/haproxystats.sock mode 600 level admin expose-fd listeners
    tune.bufsize 32768
    tune.comp.maxlevel 2
#    tune.maxaccept 20
    tune.ssl.cachesize 64000000
    tune.ssl.default-dh-param 2048
 #   tune.listener.default-shards by-thread # https://docs.haproxy.org/3.0/configuration.html#3.2-tune.listener.default-shards
 #   tune.listener.multi-queue fair # https://docs.haproxy.org/3.0/configuration.html#3.2-tune.listener.multi-queue
    #tune.applet.zero-copy-forwarding off

defaults
    load-server-state-from-file global
    maxconn 3000000
    mode tcp
    retries 3
    timeout check 11s
    timeout client 16s
    timeout connect 10s
    timeout server 16s

and there are zero performance issues, it runs essentially indistinguishable from the current nbproc config.

Then I copied my 3.0.9 binary into /usr/sbin and restarted haproxy, and instantly noticed the saturation issue. This time I took a packet capture listening to “any” and it looked surprisingly normal at first, just like it was a lot of traffic. Keep in mind I see 10x network utilization at the host (ip/ethtool) level, and now I noticed there are a lot of drops on the RX side of the host interface.

Then I noticed these lines interleaved throughout:

440586	1.167270	6c:62:81:00:00:01		IPv4	2972	Bogus IPv4 version (15, must be 4)

Any idea what this traffic is? It only appears in the pcap when I’m running v3.0.9, and my capture filter is very simple, e.g. `tcpdump -i any -n ‘host x.x.x.x and tcp port 443’.

EDIT/UPDATE: I’ve confirmed that the traffic is legitimate (seen at the upstream switch interface), so I’ll continue to troubleshoot it there. Thanks for your help so far!

I should add that I changed no other config at all other than swapping binaries. The haproxy boxes in each rack are peered to the top of rack switch via BGP and advertise a single anycast address for this service, and are load balanced using ECMP. I checked the route table on the switch when advertising and not-advertising the address and the route table looks correct (one route to each SLB with the same cost); when swapping the binaries only, the route table looks the same.

After replacing the binary check -vv output and the libraries used:

$(which haproxy) -vv
ldd $(which haproxy)

Specifically regarding the SSL version, but other informations can be useful as well.

From which package exactly? There is no 2.2 package available for Ubuntu 22.04:

The thing I am finding very strange is that the only thing I am doing to reproduce the behavior now is swapping the binary and reloading.

It’s possible I packaged this for us to get LTS HAProxy with nbproc on Ubuntu 22.04… :smiley:

dpkg-query -s haproxy
Package: haproxy
Status: install ok installed
Priority: optional
Section: net
Installed-Size: 3672
Maintainer: Debian HAProxy Maintainers <team+haproxy@tracker.debian.org>
Architecture: amd64
Version: 2.2.30-1ppa1~jammyubuntu1

But here’s some version/compilation info for you from haproxy versions as requested above:

From 2.2.30 machine

aaron.finney@hostname:~$ haproxy -vv
HA-Proxy version 2.2.30-1ppa1~jammyubuntu1 2023/08/01 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.30.html
Running on: Linux 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -O2 -ffile-prefix-map=/home/aaron.finney/haproxy-testing/haproxy-2.2.30=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-stringop-overflow -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_SYSTEMD=1
  DEBUG   =

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL +EPOLL -EVPORTS +FUTEX +GETADDRINFO -KQUEUE +LIBCRYPT +LINUX_SPLICE +LINUX_TPROXY +LUA +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PRIVATE_CACHE -PTHREAD_PSHARED +RT -SLZ -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL +ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=64).
Built with OpenSSL version : OpenSSL 3.0.2 15 Mar 2022
Running on OpenSSL version : OpenSSL 3.0.2 15 Mar 2022
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.6
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.39 2021-10-29
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 11.4.0
Built with the Prometheus exporter as a service

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
              h2 : mode=HTTP       side=FE|BE     mux=H2
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services : prometheus-exporter
Available filters :
	[SPOE] spoe
	[COMP] compression
	[TRACE] trace
	[CACHE] cache
	[FCGI] fcgi-app

aaron.finney@hostname:~$ ldd $(which haproxy)
	linux-vdso.so.1 (0x00007ffc98740000)
	libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f6ccce1d000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f6ccce01000)
	libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f6cccd5d000)
	libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007f6ccc919000)
	liblua5.3.so.0 => /lib/x86_64-linux-gnu/liblua5.3.so.0 (0x00007f6ccc8dc000)
	libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x00007f6ccc813000)
	libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f6ccc77c000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6ccc75c000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6ccc533000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6ccc44c000)
	liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f6ccc421000)
	libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x00007f6ccc350000)
	liblz4.so.1 => /lib/x86_64-linux-gnu/liblz4.so.1 (0x00007f6ccc330000)
	libcap.so.2 => /lib/x86_64-linux-gnu/libcap.so.2 (0x00007f6ccc325000)
	libgcrypt.so.20 => /lib/x86_64-linux-gnu/libgcrypt.so.20 (0x00007f6ccc1e7000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6ccd1ff000)
	libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x00007f6ccc1c1000)
aaron.finney@hostname:~$

and from the 3.0.8 machine:

aaron.finney@hostname:~$ haproxy -vv
HAProxy version 3.0.9-7f0031e 2025/03/20 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2029.
Known bugs: http://www.haproxy.org/bugs/bugs-3.0.9.html
Running on: Linux 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64
Build options :
  TARGET  = linux-glibc
  CC      = cc
  CFLAGS  = -O2 -g -fwrapv
  OPTIONS = USE_PTHREAD_EMULATION=1 USE_OPENSSL_AWSLC=1 USE_LUA=1 USE_PROMEX=1
  DEBUG   =

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL +OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT -PCRE -PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX +PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=192).
Built with OpenSSL version : OpenSSL 1.1.1 (compatible; AWS-LC 1.49.1)
Running on OpenSSL version : AWS-LC 1.49.1
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.4.7
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built without PCRE or PCRE2 support (using libc's regex instead)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 11.4.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : prometheus-exporter
Available filters :
	[BWLIM] bwlim-in
	[BWLIM] bwlim-out
	[CACHE] cache
	[COMP] compression
	[FCGI] fcgi-app
	[SPOE] spoe
	[TRACE] trace

aaron.finney@hostname:~$ ldd $(which haproxy)
	linux-vdso.so.1 (0x00007ffd2459f000)
	libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f6df5941000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6df585a000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6df5631000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6df6350000)
aaron.finney@hostname:~$

I doubt 2.2.30 on openssl 3 would have acceptable multi threading performance in the configuration you are running.

I cannot explain what you are seeing. There must be some wrong assumptions somewhere along the line, but I can’t put my finger on it.

When you say your are just replacing the binary do you mean that literally or do you mean you replace a package. I’m asking because you are showing dpkg-query -s haproxy which suggest you do the latter, not the former.

Are you positive all binaries are always in the same path?

I don’t think you should reload, but restart instead for multiple reasons.

First of all there have been plenty of reload bugs, including bugs that lead to 100% CPU loop. You don’t want to mistake a different and irrelevant reload bug with a lack of ssl multithreading performance. You are exposing your tests to many unknown variables.

Reloading between binaries 8 major releases a part is just asking for trouble.

You want your tests to be clean as possible.

Sorry, for the sake of brevity on this platform I probably left out too many details. I’ll give some more background.

The traffic hitting these proxies is for a public IP from the Internet, via anycast; any given service has the same public IP that is specific to the site. The path is Internet → edge routing devices → spines → ToR (via ECMP) → HAProxy frontend (via ECMP) → backend. Everything is peered via BGP, and we are using an internally-written BGP daemon on the HAProxy hosts (using gobgp library) to advertise/withdraw the routes to VIP addresses that’s are configured on the loopback.

I intentionally picked a site with lower production traffic volume to test HAProxy 3.0.8 once I validated it in fairly exhaustive local performance testing, so the current issues were a big surprise. :grin: Based on what you’ve said I’m not certain the multithreaded config would perform as well at the volume of traffic of our largest sites, but that’s TBD once we sort out this current issue.

In this current round of testing, my process is exactly this:

aaron.finney@vam-gp-aba35:~$ ./haproxy-2.2.3 -vv | grep 'HA-Proxy version'
HA-Proxy version 2.2.30-1ppa1~jammyubuntu1 2023/08/01 - https://haproxy.org/
aaron.finney@vam-gp-aba35:~$ date
Wed Apr 23 04:59:24 PM UTC 2025
aaron.finney@vam-gp-aba35:~$ sudo cp -f haproxy-2.2.3 /usr/sbin/haproxy
aaron.finney@vam-gp-aba35:~$ sudo systemctl restart haproxy
aaron.finney@vam-gp-aba35:~$
aaron.finney@vam-gp-aba35:~$
aaron.finney@vam-gp-aba35:~$ ./haproxy-3.0.9-aws-lc -vv | grep 'HAProxy version'
HAProxy version 3.0.9-7f0031e 2025/03/20 - https://haproxy.org/
aaron.finney@vam-gp-aba35:~$ sudo cp -f haproxy-3.0.9-aws-lc /usr/sbin/haproxy
aaron.finney@vam-gp-aba35:~$ sudo systemctl restart haproxy
aaron.finney@vam-gp-aba35:~$ date
Wed Apr 23 05:09:23 PM UTC 2025
aaron.finney@vam-gp-aba35:~$ sudo cp -f haproxy-2.2.3 /usr/sbin/haproxy
aaron.finney@vam-gp-aba35:~$ sudo systemctl restart haproxy
aaron.finney@vam-gp-aba35:~$ date
Wed Apr 23 05:14:06 PM UTC 2025
aaron.finney@vam-gp-aba35:~$

Here are graphs with correlating time marks showing the effect from the perspective of HAProxy metrics:

and here are graphs showing the effect from the perspective of the Linux OS (prom node exporter):

This is output traffic (Gbps) on the top of rack switch port that the HAProxy node is connected to:

and this is output traffic on the spine ports that interconnect that top of rack switch:

So the traffic is “real”, it’s just that if it’s all this (not confirmed) then it’s something malformed/strange:

Those are all consequences of the multithreading issues with OpenSSL v3.0.

When the CPU is pinned at 100% users will keep hitting refresh, applications will aggressively retry and all traffic numbers will skyrocket.

I would suggest you validate aws-lc or wolfssl for production.

From above:

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=192).
Built with OpenSSL version : OpenSSL 1.1.1 (compatible; AWS-LC 1.49.1)
Running on OpenSSL version : AWS-LC 1.49.1

When I was using OpenSSL the CPUs were pegged at 100%; since moving to AWS-LC library they are comparatively very low, but now we’re seeing this unexpected traffic and bursts/saturation on the NIC. There are no other changes to the config at all (or upstream routing, I have confirmed this on all devices), just changing out the binary and restarting as I show above.

Edit: the “bogus protocol” packets above were an artifact of my tcpdump grabbing both the tagged traffic on the trunk interface and the untagged traffic on the vlan interface due to using the -i any option :roll_eyes:

Where I’m at now is that there is a legitimate/confirmed increase in load between a server running 2.2.3 and a server running 3.0.9 side-by-side with an equal opportunity to see traffic (ECMP). I’m currently chasing the idea that there’s possible a difference in the mechanism handling connections/timeouts between the two that would explain the delta, but as far as 3.0.9 choking on the traffic when 2.2.3 does not, I’m investigating whether it has to do with how connection limits are applied (e.g. maxconn, etc) differently with nbthread vs nbproc.

Can you upgrade to 3.0.10, just to make sure you have all the latest fixes.

A fix that stands out:

- the lock fairness improvements that reduce tail latencies on large
AMD CPUs that had been merged into 3.1.6 were also backported since
the gains were significant.

There are also some interesting epoll fixes, so the upgrade is worth it.

Thanks for all of the input and help. Compiling with aws-lc TLS library already had a huge impact on CPU, and helped reveal the actual cause of the issue, which turned out to be changes to default ALPN configuration in the bind directives.

From the v3.0 manual:

At the protocol layer, ALPN is required to enable HTTP/2 on an HTTPS frontend and HTTP/3 on a QUIC frontend. However, when such frontends have none of “npn”, “alpn” and “no-alpn” set, a default value of “h2,http/1.1” will be used for a regular HTTPS frontend, and “h3” for a QUIC frontend.

and from the v2.2 manual:

This enables the TLS ALPN extension and advertises the specified protocol
list as supported on top of ALPN. The protocol list consists in a comma-
delimited list of protocol names, for instance: “http/1.1,http/1.0” (without
quotes).

What was happening in my case is that I’d put a single v3.0.9 proxy with an HTTPS frontend into production, alongside dozens of v2.2.3 proxies with that same HTTPS frontend configuration. Since new connections were being distributed equally to all proxies via ECMP, the single v3.0.9 proxy that was happily negotiating HTTP2 connections started to rapidly accumulate orders of magnitude more connections that all of its neighbors, and quickly fell over.

The immediate remediation was adding no-alpn to the HTTPS frontend, which resolved the issue. The good news is this was great for gathering information re: enabling HTTP2 on our production HTTPS frontends, we feel like we understand the behavior much better and can start orienting towards enabling it everywhere, maybe using tune.h2.fe.max‑total‑streams to gradually expand it while watching for hotspots.

1 Like

Very interesting.

So I’m assuming ECMP in this case considers source and destination ports as well, not just source and destination IP?

Otherwise I don’t think it would have made this huge difference.

I’m tagging @willy because this is a interesting outcome.

Thanks a lot Lukas for the ping and Aaron for all the details! At first I thought the problem was related to negotiation issues which were not very clear to me, but actually it’s a matter of requests processed per proxy: based on your ECMP, each proxy should have roughly the same number of connections, except that the ones supporting H2 have many more requests per connection, hence a lot more work to do for an equal number of connections. Indeed, if a browser creates, say, 4 H1 connections that are spread over 4 different LB, when it creates one H2 connection on a single LB (this one), it sends all its traffic there.

Also one thing to consider is how the traffic reaches the backend servers. I guess this is over SSL. As Lukas mentioned, OpenSSL 3.0 is a nightmare for performance, particularly on this side (have a look at the latest blog article on haproxy.com where we compare the libs’ performance). And with H2 since requests arrive in parallel, I’m pretty sure that you can have a bit more parallel connections to the servers with more bursty traffic, which can increase the effort on openssl.

Another point that comes to my mind is the fact that errors and aborted transfers in HTTP/1 can only be done by closing the connection, while in H2 only the stream is closed and the connection kept open. Thus as soon as your clients start to browse a bit quickly (hitting “Back” too fast, clicking “Stop” etc), the number of H1 connections that browsers maintain can diminish. This could result in the average number of H1 connections actually being lower than the number of long-lived idle H2 connections despite the expected ratio of ~4:1.

Your approach consisting in reducing the advertised number of parallel streams in an heterogenous farm like this sounds very wise. I think that’s among the suggestions we could add in the wiki. Maybe advices for deployment and migrations. I must confess that this is a situation we had never thought about till now!

Oh, another point regarding your thread binding: I’d recommend against using “cpu-map…auto:”. The reason is that each thread will only have a single CPU. While this initially sounds desirable, it causes trouble because when that CPU is busy doing something else (e.g. running a ksoftirqd thread of your 100G NIC), the interrupted thread cannot make any forward progress. And if this happens while it holds a lock, all other threads waiting on the same lock suffer. And this gets worse during reloads, because both the threads of the old and the new process are tied to the same core, fighting against each other for CPU despite having many other available. When they’re losely bound, the scheduler will simply migrate the thread to another CPU and let it run.

Thus I’d suggest dropping “auto” on the cpu-map directive.

You can go further and use thread-groups + cpu-map (in this case, allowing all threads of a group to run on all CPUs of the same CCX). This would also significantly help for your listeners. On a 64-thread system, particularly if it’s an EPYC, at high connection rates you should see “native_queued_spin_lock_slowpath” taking most of your CPU in the kernel (sometimes even 80-90%). This happens when the same listening FD jumps from thread to thread with many threads and high latencies between some of them. Having multiple groups allows to have multiple listeners and can totally get rid of this unwanted locking overhead. 3.2 could help a lot for this (but then use -dev12 or preferably -dev15 that I’ll try to emit today, but skip dev13 (broken) and dev14 (still a bit broken)).

Great info, thank you both for so much interest and time! I’ll follow up again as we progress and try out your additional suggestions.