Haproxy always consumes 99%-100% CPU

Hello all,

I am currently utilizing a proxy to function as a reverse proxy, and I’ve deployed HAProxy using a Docker stack with just one replica. However, I’m encountering an issue where the proxy consistently shows a usage of 99%-100% in the top command. Could you kindly advise if there are any configuration adjustments I should make to address this?

Could you also suggest, how I should tune with nbproc and nbthreads ?

My Use case,

  • I have almost 2000 backends and 2 frontends
  • Concurrent requests: ~800

My System Configuration

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 14
On-line CPU(s) list: 0-13
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 14
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping: 1
CPU MHz: 2596.992
BogoMIPS: 5193.98
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13

Haproxy version details

HA-Proxy version 2.2.17-dd94a25 2021/09/07 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.17.html
Running on: Linux 3.10.0-1160.88.1.el7.x86_64 #1 SMP Sat Feb 18 13:27:00 UTC 2023 x86_64
Build options :
TARGET = linux-glibc
CPU = generic
CC = gcc
CFLAGS = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-stringop-overflow -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1
DEBUG =

Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=14).
Built with OpenSSL version : OpenSSL 1.1.1k 25 Mar 2021
Running on OpenSSL version : OpenSSL 1.1.1k 25 Mar 2021
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.36 2020-12-04
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 10.2.1 20210110
Built with the Prometheus exporter as a service

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as cannot be specified using ‘proto’ keyword)
fcgi : mode=HTTP side=BE mux=FCGI
: mode=HTTP side=FE|BE mux=H1
h2 : mode=HTTP side=FE|BE mux=H2
: mode=TCP side=FE|BE mux=PASS

Available services : prometheus-exporter
Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace
[CACHE] cache
[FCGI] fcgi-app

Haproxy Global & default config

\tlog /dev/log local1 info alert

\tpidfile /run/haproxy.pid
\tmaxconn 8000
\tnbproc 1
\tnbthread 64
\tlua-load /usr/local/share/lua/5.3/accessControl.lua

\t# turn on stats unix socket
\tstats socket /var/lib/haproxy/stats
\tstats socket 127.0.0.1:14567

\t# utilize system-wide crypto-policies
\ttune.ssl.default-dh-param 2048
\tssl-default-bind-ciphers TLS13-AES-256-GCM-SHA384:TLS13-AES-128-GCM-SHA256:TLS13-CHACHA20-POLY1305-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256:EECDH+AESGCM:EECDH+CHACHA20
\tssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11
\ttune.bufsize 32768

defaults
\toption forwardfor except 127.0.0.1
\tmode http
\tlog global
\toption httplog
\tretries 3
\toption http-keep-alive
\ttimeout http-request 10s
\ttimeout queue 10m
\ttimeout connect 10s
\ttimeout client 10m
\ttimeout server 10m
\ttimeout http-keep-alive 10s
\ttimeout check 10s
\tlog-format %[var(txn.identifier)]\ %ci:%cp\ %fi:%fp\ %b\ %si:%sp\ %ST\ %HM\ %HU
\tbalance roundrobin
\t# never fail on address resolution
\tdefault-server init-addr last,libc,none

100% CPU load is indicative of a major problem (a bug or a important missconfiguration). Not the lack of tuning, especially at 800 connections.

Configuring 64 threads when your CPU really only has 14 cores is very counterproductive.

Please remove nbproc and nbthreads configurations from your configuration first of all.

Next thing is to check what your LUA script does exactly. It must not interact with the filesystem for example, this will lead to request blocking the event loop.

Please provide the configuration of your frontends and at least a few of the backends.

Finally haproxy 2.2.17 is 2 years old and 279 known bugs have been fixed since then. It’s possible you hit one of them. Checkout the haproxy wiki for community maintained repositories containing more uptodate haproxy packages.

Thank you very much @lukastribus for the response.

Sorry 64 nbthread was by mistake I added in the code, I was testing more with 8 or 16 actually.

I had a problem when there was nothing mentioned of nbproc/nbthreads in the configuration and I thought I had to tune something so I started trying these.

When I looked at my Lua script, I have little use case to looking at the PEM certificates present in my directory

– readPEM files
function readAll(file)
log(“Reading file " … file)
local f = assert(io.open(file, “rb”))
local content = f:read(”*all")
f:close()
return content
end

and my frontend

frontend https-frontend

bind :443 ssl crt ${conf_path}/certs/
bind :443 ssl crt ${conf_path}/ssl/haproxy_ssl.pem

sample backend

backend myapp
server myapp0 myhost.com:443 ssl verify none check

Regarding the update to the latest haproxy, I require a lot of testing. So I wanted to first fix this problem and slowly can plan for the upgrade.

Could you please suggest me something with above details.

Thanks a lot again :slight_smile:

Like I said, you must not access the filesystems from within LUA scripts. This is very likely your root cause.

But you still didn’t explain what your LUA script does and you still did not share the full frontend configuration, other than a few standard configurations statements.

Okay sure here is my full configuration, In my Lua script, I am trying to do some checks on the JWT, etc, and nothing more to perform these I am relying on reading the certs from my disk.

So will add additional latency while processing the requests? Because in my haproxy I always see queue time as 0

Thanks a lot again

function auth(txn, pem, issuer, tokenName, audience)
– Get JWT from cookie set on domain (eg.:.tnz.amadeus.net)
local cookie_token = txn.sf:req_cook(tokenName)

local validity, paccount = verifyJwt(cookie_token, pem, issuer, audience)

if validity == true then
txn.set_var(txn, “txn.authorized”, true)
txn.set_var(txn, “txn.identifier”, paccount)
do return end
end

txn.set_var(txn, “txn.authorized”, false)
do return end
end

function simpleAuth(txn)
auth(txn, config.publicKey, config.issuer, config.tokenName, config.audience)
end

#---------------------------------------------------------------------

Global settings

#---------------------------------------------------------------------

global
log /dev/log local1 info alert

pidfile /run/haproxy.pid
maxconn 8000
lua-load /usr/local/share/lua/5.3/accessControl.lua

turn on stats unix socket

stats socket /var/lib/haproxy/stats
stats socket 127.0.0.1:14567

utilize system-wide crypto-policies

tune.ssl.default-dh-param 2048
ssl-default-bind-ciphers TLS13-AES-256-GCM-SHA384:TLS13-AES-128-GCM-SHA256:TLS13-CHACHA20-POLY1305-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256:EECDH+AESGCM:EECDH+CHACHA20
ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11
tune.bufsize 32768

setenv ISSUER https://auth.myapp.com/auth/realms/rtr
setenv AUDIENCE account
setenv PUBKEY /etc/haproxy/pem/rtr.pem
setenv TOKEN_NAME token-rtr

#---------------------------------------------------------------------

common defaults that all the ‘listen’ and ‘backend’ sections will

use if not designated in their block

#---------------------------------------------------------------------

defaults
option forwardfor except 127.0.0.1
mode http
log global
option httplog
retries 3
option http-server-close
timeout http-request 10s
timeout queue 10m
timeout connect 10s
timeout client 10m
timeout server 10m
timeout http-keep-alive 10s
timeout check 10s
log-format %[var(txn.identifier)]\ %ci:%cp\ %fi:%fp\ %b\ %si:%sp\ %ST\ %HM\ %HU
balance roundrobin

never fail on address resolution

default-server init-addr last,libc,none

#---------------------------------------------------------------------

stats page

#---------------------------------------------------------------------

listen stats
bind :9090
mode http
stats enable # Enable stats page
stats hide-version # Hide HAProxy version
stats uri /

#---------------------------------------------------------------------

main frontenda which proxys to the backends

#---------------------------------------------------------------------

frontend http-frontend
bind :80
redirect scheme https if !{ ssl_fc } !{ hdr(host) test .myapp.com }
use_backend test if { hdr(host) test.myapp.com }

frontend https-frontend
bind :443 ssl crt /etc/haproxy/certs/

bind :443 ssl crt /etc/haproxy/ssl/haproxy_ssl.pem

#---------------------------------------------------------------------

Default list of internal micro service backend

#---------------------------------------------------------------------

use_backend sirius if { hdr(host),lower,word(1,:slight_smile: sirius.myapp.com }

backend withlogin
http-request lua.simpleAuth
acl cookie var(txn.authorized) -m bool
acl apikey req.hdr(MY-Key) -m str -f /etc/haproxy/keys/cb-exporter
http-request set-var(txn.identifier) str(cb-exporter) if apikey
http-request redirect code 302 location https://login.myapp.com?next_url=%[hdr(host)]%[capture.req.uri,regsub(&,%26,g)] unless cookie || apikey
cookie SERVERID insert indirect nocache
server withlogin0 myhost.com:18091 ssl verify none check cookie withlogin0
backend myapp
server myapp0 myhost.com:443 ssl verify none check

Which you must absolutely not do.

You are reading directories full of files from a LUA script for every single HTTP request. This blocks the event loop and saturates the CPU.

You need to go back to the drawing board.

How such things are usually accomplished is by accessing for example a redis database from within a haproxy lua script. The redis database is then filled by external processes. External HTTP requests are also possible (they are non blocking).

But one thing, all the backends are not included with lua execution only 100 of 2000 I can. Still, you think that will cause the problem.

My main problem was the proxy sometimes added more latency while processing the requests and then I started looking at all these options to configure the haproxy well.

Absolutely, yes. I few requests per second will suffice to cause mayhem.

Please see:

https://www.mail-archive.com/haproxy@formilux.org/msg42795.html

Thanks a lot, @lukastribus for your suggestion, I will go through this in detail

@lukastribus , Is there any way I can measure this through the command, where it’s getting blocked etc…

I supposed you can trace your syscalls with strace -tt. But unless you are an expert C developer, interpreting the results and it’s impact for a multi threading C application is likely very difficult.

I think this is a pointless exercise. You are facing the precise reason those functions are not allowed in haproxy in the first place.

Your time is better spend in finding a supported alternative.

Hello @lukastribus ,

Thank you once again for your guidance. Following your recommendations, I’ve eliminated the use of io.open within the Lua scripts.

In order to validate this change, I replaced the dynamic file reading with a direct hardcoding of the variable.

But, I haven’t observed any noticeable performance improvements.
By the way, when I mention that the CPU is running at 100%, I’ve been checking this through commands like top or htop. Below is a sample excerpt from htop on the HAProxy server. In this excerpt, you can see that the first row displays a value of 171. Am I misinterpreting this information? Please provide your insights or opinions on this matter.

Its opening the 14 threads as matches in the stats page as well.

PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command

7859 root 20 0 1028M 272M 8424 S 171. 0.5 1h04:14 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7866 root 20 0 1028M 272M 8424 S 16.3 0.5 4:43.33 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7860 root 20 0 1028M 272M 8424 S 15.0 0.5 4:44.11 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7872 root 20 0 1028M 272M 8424 S 13.7 0.5 4:48.43 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7862 root 20 0 1028M 272M 8424 S 13.1 0.5 4:37.71 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7863 root 20 0 1028M 272M 8424 S 13.1 0.5 4:22.80 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7870 root 20 0 1028M 272M 8424 S 13.1 0.5 4:28.13 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7861 root 20 0 1028M 272M 8424 S 11.8 0.5 4:12.72 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7864 root 20 0 1028M 272M 8424 R 11.8 0.5 4:26.16 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7865 root 20 0 1028M 272M 8424 S 11.8 0.5 4:29.60 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7868 root 20 0 1028M 272M 8424 S 11.8 0.5 4:24.45 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7867 root 20 0 1028M 272M 8424 S 11.1 0.5 5:05.14 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7871 root 20 0 1028M 272M 8424 R 9.2 0.5 4:52.56 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg
7869 root 20 0 1028M 272M 8424 S 8.5 0.5 4:19.61 haproxy -W -db -W -f /etc/haproxy/haproxy.cfg

Haproxy Stats

pid = 37 (process #1, nbproc = 1, nbthread = 14)
uptime = 0d 0h44m07s
system limits: memmax = unlimited; ulimit-n = 19559
maxsock = 19559; maxconn = 8000; maxpipes = 0
current conns = 644; current pipes = 0/0; conn rate = 53/sec; bit rate = 16.039 Mbps
Running tasks: 1/7358; idle = 87 %

Thanks a lot in advance !!

Regards,
Mahesh

It consumed more than a CPU core worth of CPU time, that’s why it goes over 100%.

It’s still important to keep the filesystem access out of the LUA script, even though the problem is not solved.

You likely hit a bug.

I suggest you upgrade first of all to the latest 2.2 bugfix release. This doesn’t require any testing at all, passing from 2.2.17 to 2.2.31 you only fix bugs.

You can find a link to RHEL builds in my first response.

Okay sure thank you, I will upgrade to 2.2.31 and try this one.

Hello @lukastribus

I initially gave a shot with 2.2.31 and it was still the same.
And today we tried with the 2.8.3 version but we are still seeing the same issue.

Again when I say it’s high CPU usage it’s from htop/top command and usually haproxy 1st entry will have CPU usage > 100%

Is there any way to investigate further on this

Thanks a lot.

Regards,
Mahesh

On the haproxy admin socket run and provide the outputs:

show info
show pools
show activity

When haproxy is in the high CPU condition on the admin socket, enable profiling:
set profiling tasks on

Wait for a few seconds then disable it:
set profiling tasks off

Now you can get the report for the given observation period using
show profiling tasks

Also please run perf top on your machine.