Hi all, I’m working on trying to port existing haproxy instances from 2.2.3 using nbproc to 3.0.8 nbthread and hitting a (literal) TLS performance wall.
The hardware and underlying OS is identical, I’m just trying to update to a newer version and use a supported config:
- AMD EPYC 9654P 96-Core Processor (hyperthreaded, no NUMA) 3.7GHz, 768GB RAM
- Mellanox MT28908 ConnectX-6 NIC, 100GB DAC, FEC enabled
- Ubuntu 22.04 (Linux version 5.15.0-60-generic)
The 2.2.3 config is simply using nbproc set to 64 with no other CPU config:
global
user haproxy
group haproxy
daemon
maxconn 450000
nbproc 64
server-state-file /tmp/haproxybackendstate
set-dumpable
ssl-dh-param-file /etc/ssl/ttd/dh_params
ssl-mode-async
stats maxconn 200
tune.bufsize 32768
tune.comp.maxlevel 2
tune.ssl.cachesize 1000000
tune.ssl.default-dh-param 2048
defaults
load-server-state-from-file global
maxconn 450000
mode tcp
retries 3
timeout check 11s
timeout client 16s
timeout connect 10s
timeout server 16s
and the 3.0.8 config I’ve ended up with for testing effectively tries to mirror that:
global
user haproxy
group haproxy
daemon
nbthread 64
thread-groups 1
cpu-map auto:1/1-64 0-63
maxconn 450000
server-state-file /tmp/haproxybackendstate
set-dumpable
ssl-dh-param-file /etc/ssl/ttd/dh_params
ssl-mode-async
stats maxconn 200
stats socket /var/run/haproxystats.sock mode 600 level admin expose-fd listeners
tune.bufsize 32768
tune.comp.maxlevel 2
tune.ssl.cachesize 1000000
tune.ssl.default-dh-param 2048
tune.listener.default-shards by-thread # https://docs.haproxy.org/3.0/configuration.html#3.2- tune.listener.default-shards
tune.listener.multi-queue fair # https://docs.haproxy.org/3.0/configuration.html#3.2-tune.listener.multi-queue
defaults
load-server-state-from-file global
maxconn 450000
mode tcp
retries 3
timeout check 11s
timeout client 16s
timeout connect 10s
timeout server 16s
the front/backend configs are the same for both:
frontend myservice443
bind x.x.x.x:443 ssl crt /etc/ssl/mycompany/star_global_pub_priv
default_backend myservice443
mode http
option forwardfor if-none
frontend myservice80
bind x.x.x.x:80
default_backend myservice80
mode http
option forwardfor if-none
backend myservice443
balance random
http-check expect status 200
http-check send-state
mode http
option httpchk GET /service/health?from=lb
server backendpod1 172.18.95.118:443 ca-file /etc/ssl/mycompany/star_global_cert_ca check fall 4 inter 5s rise 3 slowstart 15s ssl verify none weight 100
server backendpod2 172.18.89.81:443 ca-file /etc/ssl/mycompany/star_global_cert_ca check fall 4 inter 5s rise 3 slowstart 15s ssl verify none weight 100
server backendpod3 172.18.85.74:443 ca-file /etc/ssl/mycompany/star_global_cert_ca check fall 4 inter 5s rise 3 slowstart 15s ssl verify none weight 100
server backendpod4 172.18.93.160:443 ca-file /etc/ssl/mycompany/star_global_cert_ca check fall 4 inter 5s rise 3 slowstart 15s ssl verify none weight 100
backend myservice80
balance random
http-check expect status 200
http-check send-state
mode http
option httpchk GET /service/health?from=lb
server backendpod1 172.18.95.118:80 check fall 4 inter 5s rise 3 slowstart 15s weight 100
server backendpod2 172.18.89.81:80 check fall 4 inter 5s rise 3 slowstart 15s weight 100
server backendpod3 172.18.85.74:80 check fall 4 inter 5s rise 3 slowstart 15s weight 100
server backendpod4 172.18.93.160:80 check fall 4 inter 5s rise 3 slowstart 15s weight 100
(although I’ve also tried 3 thread groups with 64 processors in each)
and finally, these are the kernel tuning parameters for both systems:
linux_sysctl:
net.ipv4.conf.all.proxy_arp: 0
net.ipv4.tcp_window_scaling: 1
net.ipv4.tcp_fin_timeout: 10
net.ipv4.ip_forward: 1
net.ipv4.conf.all.rp_filter: 2
net.ipv4.conf.default.rp_filter: 2
fs.file-max: 5000000
fs.nr_open: 5000000
net.ipv4.tcp_max_syn_backlog: 3240000
net.core.somaxconn: 100000
net.core.netdev_max_backlog: 100000
net.ipv4.ip_local_port_range: "{{ 1024 + ansible_processor_vcpus }} 65535"
net.netfilter.nf_conntrack_buckets: 425440
net.netfilter.nf_conntrack_max: 10035200
net.netfilter.nf_conntrack_tcp_timeout_close_wait: 20
net.netfilter.nf_conntrack_tcp_timeout_fin_wait: 20
net.netfilter.nf_conntrack_tcp_timeout_time_wait: 20
net.ipv4.tcp_max_orphans: 5000000
net.ipv4.conf.all.arp_ignore: 1
net.ipv4.conf.default.arp_ignore: 1
net.ipv4.conf.all.arp_announce: 2
net.ipv4.conf.default.arp_announce: 2
The behavior is that within seconds of turning on traffic to the instance (via rack-level routing change):
- 4xx and 5xx errors start climbing
- CPU utilization climbs until it hits 100% on all assigned processors
- If I disable the TLS frontend, the non-TLS proxy works fine with no performance issues
- If I enable the TLS frontend, neither proxy is healthy (I don’t see the 4xx and 5xx errors spike as high on the non-TLS service, but it never achieves healthy traffic levels)
I have what is essentially an ideal production testing environment, where these SLBs are peering via BGP to the top of rack switch and I have effectively equal amounts of traffic going to them; in this rack, there are two SLBs total, one running the legacy 2.2.3 version with nbproc and the other with 3.0.8 running the nbthread config. The cert being used is a 2048 bit wildcard.
Any initial thoughts? My initial investigation seems to to point to receive queues being overloaded, I’ve tried things like setting maxconn to high values (e.g. 2M), same with ssl.cachesize (e.g. 64M), since those are shared for all threads.