Hi,
I’m seeing HAProxy spike for periods of 100% CPU usage across 4 CPUs at seemingly random times. The spikes don’t appear to have any correlation to restarts (of which around 5-10 occur a day). We were initially seeing this when running 2.3.14 and have updated to 2.4.7 and have the same problem. It seems fairly erratic, in that we may go periods of weeks with no issues, only to find that we see this occur several times over a day or two. Restarting the HAProxy process does ‘resolve’ the problem, though it may last only a short period before happening again, or it may work normally for days or weeks.
I’ve included some debugging information that will hopefully prove useful.
Version information
$ haproxy -vvv
HAProxy version 2.4.7-b5e51a5 2021/10/04 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.7.html
Running on: Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64
Build options :
TARGET = linux-glibc
CPU = generic
CC = cc
CFLAGS = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
OPTIONS = USE_PCRE=1 USE_OPENSSL=1 USE_LUA=yes USE_ZLIB=1 USE_PROMEX=1
DEBUG =
Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE -PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL -PROCCTL +THREAD_DUMP -EVPORTS -OT -QUIC +PROMEX -MEMORY_PROFILING
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_THREADS=64, default=16).
Built with OpenSSL version : OpenSSL 1.1.1k 25 Mar 2021
Running on OpenSSL version : OpenSSL 1.1.1k 25 Mar 2021
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.4.2
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 10.2.1 20210110
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE|BE mux=H2 flags=HTX|CLEAN_ABRT|HOL_RISK|NO_UPG
fcgi : mode=HTTP side=BE mux=FCGI flags=HTX|HOL_RISK|NO_UPG
<default> : mode=HTTP side=FE|BE mux=H1 flags=HTX
h1 : mode=HTTP side=FE|BE mux=H1 flags=HTX|NO_UPG
<default> : mode=TCP side=FE|BE mux=PASS flags=
none : mode=TCP side=FE|BE mux=PASS flags=NO_UPG
Available services : prometheus-exporter
Available filters :
[SPOE] spoe
[CACHE] cache
[FCGI] fcgi-app
[COMP] compression
[TRACE] trace
Some perf samples during the period of running at 100% CPU
I can provide more output or data from these if required.
$ perf record -F 99 -p <pid> -a -- sleep 60
...
$ perf report --stdio
...
# Samples: 23K of event 'cycles'
# Event count (approx.): 812920350349
#
# Overhead Command Shared Object Symbol
# ........ ....... .................. ....................................................
#
50.93% haproxy haproxy [.] h2_process
8.41% haproxy [kernel.kallsyms] [k] ___bpf_prog_run
4.88% haproxy haproxy [.] run_tasks_from_lists
4.33% haproxy haproxy [.] h2_send
2.27% haproxy haproxy [.] h2_resume_each_sending_h2s
1.83% haproxy haproxy [.] __tasklet_wakeup_on
1.75% haproxy haproxy [.] h2c_frt_handle_headers
1.71% haproxy haproxy [.] h2c_decode_headers
1.11% haproxy haproxy [.] get_trash_chunk
0.59% haproxy haproxy [.] h2_io_cb
0.37% haproxy haproxy [.] _do_poll
...
$ perf record -F 99 -p <pid> -a --call-graph dwarf sleep 30
...
$ perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 11K of event 'cycles'
# Event count (approx.): 406857422882
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... .................. ....................................................
#
54.11% 51.13% haproxy haproxy [.] h2_process
|
|--12.67%--_start
| __libc_start_main
| main
| run_thread_poll_loop
| run_poll_loop
| process_runnable_tasks
| run_tasks_from_lists
| h2_io_cb
| h2_process
| |
| --12.36%--h2_process_demux (inlined)
|
--2.98%--h2_process
|
|--1.63%--h2_process_demux (inlined)
| |
| --0.99%--h2c_frt_handle_headers
| |
| --0.70%--h2c_decode_headers
|
--1.35%--h2_send
CPU usage during this period - 0-3 are allocated to HAProxy (most other CPUs are idle).
$ mpstat -P ALL 1
...
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: all 21.87 0.00 3.45 0.00 0.00 1.14 0.00 0.00 0.00 73.54
Average: 0 85.00 0.00 12.62 0.00 0.00 2.38 0.00 0.00 0.00 0.00
Average: 1 85.00 0.00 12.54 0.00 0.00 2.46 0.00 0.00 0.00 0.00
Average: 2 85.00 0.00 12.77 0.00 0.00 2.23 0.00 0.00 0.00 0.00
Average: 3 85.91 0.00 11.93 0.00 0.00 2.16 0.00 0.00 0.00 0.00
...
Average: 15 0.62 0.00 0.54 0.00 0.00 0.00 0.00 0.00 0.00 98.84
syscall counts of 2 different HAProxy servers we’re running over a short period. Server A has higher traffic during this period, Server B is the server experiencing the CPU usage issue at the time these samples were taken. Note the huge difference with clock_gettime
and epoll_wait
on Server B.
Server A
149709 syscalls:sys_enter_recvfrom
140157 syscalls:sys_enter_clock_gettime
84679 syscalls:sys_enter_getpid
71813 syscalls:sys_enter_sendto
70124 syscalls:sys_enter_epoll_wait
28652 syscalls:sys_enter_setsockopt
26697 syscalls:sys_enter_epoll_ctl
20043 syscalls:sys_enter_getsockopt
19297 syscalls:sys_enter_accept4
15505 syscalls:sys_enter_close
10111 syscalls:sys_enter_socket
10111 syscalls:sys_enter_fcntl
10111 syscalls:sys_enter_connect
5206 syscalls:sys_enter_getsockname
3064 syscalls:sys_enter_read
1532 syscalls:sys_enter_write
101 syscalls:sys_enter_shutdown
41 syscalls:sys_enter_futex
12 syscalls:sys_enter_timer_settime
12 syscalls:sys_enter_rt_sigreturn
Server B
1297807 syscalls:sys_enter_clock_gettime
649920 syscalls:sys_enter_epoll_wait
132029 syscalls:sys_enter_recvfrom
71805 syscalls:sys_enter_getpid
64394 syscalls:sys_enter_sendto
26772 syscalls:sys_enter_setsockopt
24462 syscalls:sys_enter_epoll_ctl
16672 syscalls:sys_enter_getsockopt
14206 syscalls:sys_enter_close
12293 syscalls:sys_enter_accept4
9546 syscalls:sys_enter_socket
9546 syscalls:sys_enter_connect
9537 syscalls:sys_enter_fcntl
4525 syscalls:sys_enter_getsockname
72 syscalls:sys_enter_shutdown
62 syscalls:sys_enter_futex
40 syscalls:sys_enter_timer_settime
40 syscalls:sys_enter_rt_sigreturn
Please let me know if there is any other information I can provide.
fitzy