Ksoftirqd - high cpu and haproxy poor performance

Hello, while testing a new haproxy cluster i’m now facing an issue of poor performance.
I’ll try to provide as much details as i can about the HW, Kernel and haproxy configuration

The main issues i see is are :

  1. high cpu usage (100%) of ksoftirqd
  2. haproxy is 100% utilized very fast with just

Screenshot%20from%202019-04-15%2009-53-16

Haproxy version:
/ # haproxy -vv
HA-Proxy version 1.9.6 2019/03/29 - https://haproxy.org/
Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-format-truncation -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wno-implicit-fallthrough -Wno-stringop-overflow -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.1.1b 26 Feb 2019
Running on OpenSSL version : OpenSSL 1.1.1b 26 Feb 2019
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.5
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with PCRE version : 8.42 2018-03-20
Running on PCRE version : 8.42 2018-03-20
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Encrypted password support via crypt(3): yes
Built with multi-threading support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as cannot be specified using ‘proto’ keyword)
h2 : mode=HTX side=FE|BE
h2 : mode=HTTP side=FE
: mode=HTX side=FE|BE
: mode=TCP|HTTP side=FE|BE

Available filters :
[SPOE] spoe
[COMP] compression
[CACHE] cache
[TRACE] trace

OS and Kernel:
Ubuntu 18.04.2
Kernel: 4.15.0-46-generic

HW:
Manufacturer: Dell Inc.
Product Name: PowerEdge R640

CPU:
2X Version: Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz

kernel CPU map:

cat /sys/bus/cpu/devices/cpu0/topology/core_siblings_list
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38

cat /sys/bus/cpu/devices/cpu1/topology/core_siblings_list
1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39

Sysctl and security limits:

ulimits:
      - nofile:250000:250000

    sysctls:
      net.ipv4.conf.all.rp_filter: 1
      net.core.somaxconn: 65534
      net.core.netdev_max_backlog: 100000
      net.ipv4.ip_local_port_range: 1025 65000
      net.ipv4.conf.all.send_redirects: 1
      net.ipv4.ip_nonlocal_bind: 1
      net.ipv4.tcp_abort_on_overflow: 0
      net.ipv4.tcp_fin_timeout: 10
      net.ipv4.tcp_keepalive_time: 300
      net.ipv4.tcp_max_orphans: 262144
      net.ipv4.tcp_max_syn_backlog: 100000
      net.ipv4.tcp_max_tw_buckets: 262144
      net.ipv4.tcp_rmem: 4096 16060 64060
      net.ipv4.tcp_wmem: 4096 16384 262144
      net.ipv4.tcp_reordering: 3
      net.ipv4.tcp_synack_retries: 3
      net.ipv4.tcp_syncookies: 1
      net.ipv4.tcp_syn_retries: 5
      net.ipv4.tcp_timestamps: 0
      net.ipv4.tcp_tw_reuse: 1
      net.netfilter.nf_conntrack_max: 10485760
      net.netfilter.nf_conntrack_tcp_timeout_fin_wait: 30
      net.netfilter.nf_conntrack_tcp_timeout_time_wait: 15

Current haproxy configuration (tried with different nbproc/nbthread got the same behavior)

global
  nbproc 1
  nbthread 10
#  cpu-map auto:1/1-4 0-3
  cpu-map odd 1-20
  tune.http.logurilen 65535

  tune.ssl.default-dh-param  2048
  ssl-default-bind-ciphers TLS13-AES-256-GCM-SHA384:TLS13-AES-128-GCM-SHA256:TLS13-CHACHA20-POLY1305-SHA256:EECDH+AESGCM:EECDH+CHACHA20
  ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11
  ca-base /etc/ssl/certs
  crt-base /etc/ssl/private
  maxconn 500000

defaults
    mode                    http
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    option                  abortonclose
    retries                 3
    timeout http-request    10s
    timeout queue           2s
    timeout connect         5s
    timeout client          2m
    timeout server          2m
    timeout http-keep-alive 10s
    timeout check           5s
    maxconn                 1000

Network:

eno1np0

eno2np1

The test is done using a simple nginx as backend, and apache banchmark (ab) as client running from another node. i’ve tried different test…
ab -k -r -c 1000 -n 1000000

I’m almost sure this issue has something to do with the Network IRQ
As described in haproxy docs i would like to bind the network interfaces to the same CPU haproxy runs on (CPU0) but to the cores which haproxy does not use according to the map in global configuration.

since haproxy only uses “odd” so all “even” cores should be used by the Network interfaces,
At lease according to my understanding.

Nothing i tries made this work any better,
Need advise on how to optimize the kernel… expecting this HW to handle much more traffic

______________________________________________________________-
Update:
I also tried to run haproxy with nbthread 12 and bind to first 12 even cores

cpu-map even 2-24

Then i tried to set the smp_affinity of the Network IRQ

echo “2” > /proc/irq/43/smp_affinity
echo “8” > /proc/irq/44/smp_affinity
echo “20” > /proc/irq/45/smp_affinity
echo “80” > /proc/irq/46/smp_affinity
echo “200” > /proc/irq/126/smp_affinity
echo “800” > /proc/irq/127/smp_affinity
echo “2000” > /proc/irq/128/smp_affinity
echo “8000” > /proc/irq/129/smp_affinity

Also disabled and removed irq balance service and rebooted.
Still see high ksoftirqd cpu usage and poor haproxy performance…

Mailing list thread:

https://www.mail-archive.com/haproxy@formilux.org/msg33263.html

It seems that after running:

echo "2" > /proc/irq/43/smp_affinity
echo "8" > /proc/irq/44/smp_affinity
echo "20" > /proc/irq/45/smp_affinity
echo "80" > /proc/irq/46/smp_affinity
echo "200" > /proc/irq/126/smp_affinity
echo "800" > /proc/irq/127/smp_affinity
echo "2000" > /proc/irq/128/smp_affinity
echo "8000" > /proc/irq/129/smp_affinity

The values are now:
# cat /proc/irq/43/smp_affinity
00,00000002
# cat /proc/irq/44/smp_affinity
00,00000008
# cat /proc/irq/45/smp_affinity
00,00000020
# cat /proc/irq/46/smp_affinity
00,00000080
# cat /proc/irq/126/smp_affinity
00,00000200
# cat /proc/irq/127/smp_affinity
00,00000800
# cat /proc/irq/128/smp_affinity
00,00002000
# cat /proc/irq/129/smp_affinity
00,00008000

and i’m not sure these are pointing to the requested cores, or if they are even valid.
Can you please advise on how to set them to all the odd cores of cpu 0 ?
cat /sys/bus/cpu/devices/cpu0/topology/core_siblings_list
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38