HAProxy consuming 100% CPU (2.3.14, 2.4.7)

Hi,

I’m seeing HAProxy spike for periods of 100% CPU usage across 4 CPUs at seemingly random times. The spikes don’t appear to have any correlation to restarts (of which around 5-10 occur a day). We were initially seeing this when running 2.3.14 and have updated to 2.4.7 and have the same problem. It seems fairly erratic, in that we may go periods of weeks with no issues, only to find that we see this occur several times over a day or two. Restarting the HAProxy process does ‘resolve’ the problem, though it may last only a short period before happening again, or it may work normally for days or weeks.

I’ve included some debugging information that will hopefully prove useful.

Version information

$ haproxy -vvv
HAProxy version 2.4.7-b5e51a5 2021/10/04 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.7.html
Running on: Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE=1 USE_OPENSSL=1 USE_LUA=yes USE_ZLIB=1 USE_PROMEX=1
  DEBUG   =

Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE -PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL -PROCCTL +THREAD_DUMP -EVPORTS -OT -QUIC +PROMEX -MEMORY_PROFILING

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=16).
Built with OpenSSL version : OpenSSL 1.1.1k  25 Mar 2021
Running on OpenSSL version : OpenSSL 1.1.1k  25 Mar 2021
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.4.2
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 10.2.1 20210110

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2       flags=HTX|CLEAN_ABRT|HOL_RISK|NO_UPG
            fcgi : mode=HTTP       side=BE        mux=FCGI     flags=HTX|HOL_RISK|NO_UPG
       <default> : mode=HTTP       side=FE|BE     mux=H1       flags=HTX
              h1 : mode=HTTP       side=FE|BE     mux=H1       flags=HTX|NO_UPG
       <default> : mode=TCP        side=FE|BE     mux=PASS     flags=
            none : mode=TCP        side=FE|BE     mux=PASS     flags=NO_UPG

Available services : prometheus-exporter
Available filters :
        [SPOE] spoe
        [CACHE] cache
        [FCGI] fcgi-app
        [COMP] compression
        [TRACE] trace

Some perf samples during the period of running at 100% CPU

I can provide more output or data from these if required.

$ perf record -F 99 -p <pid> -a -- sleep 60
...
$ perf report --stdio
...
# Samples: 23K of event 'cycles'
# Event count (approx.): 812920350349
#
# Overhead  Command  Shared Object       Symbol
# ........  .......  ..................  ....................................................
#
    50.93%  haproxy  haproxy             [.] h2_process
     8.41%  haproxy  [kernel.kallsyms]   [k] ___bpf_prog_run
     4.88%  haproxy  haproxy             [.] run_tasks_from_lists
     4.33%  haproxy  haproxy             [.] h2_send
     2.27%  haproxy  haproxy             [.] h2_resume_each_sending_h2s
     1.83%  haproxy  haproxy             [.] __tasklet_wakeup_on
     1.75%  haproxy  haproxy             [.] h2c_frt_handle_headers
     1.71%  haproxy  haproxy             [.] h2c_decode_headers
     1.11%  haproxy  haproxy             [.] get_trash_chunk
     0.59%  haproxy  haproxy             [.] h2_io_cb
     0.37%  haproxy  haproxy             [.] _do_poll
...
$ perf record -F 99 -p <pid> -a --call-graph dwarf sleep 30
...
$ perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 11K of event 'cycles'
# Event count (approx.): 406857422882
#
# Children      Self  Command  Shared Object       Symbol
# ........  ........  .......  ..................  ....................................................
#
    54.11%    51.13%  haproxy  haproxy             [.] h2_process
            |
            |--12.67%--_start
            |          __libc_start_main
            |          main
            |          run_thread_poll_loop
            |          run_poll_loop
            |          process_runnable_tasks
            |          run_tasks_from_lists
            |          h2_io_cb
            |          h2_process
            |          |
            |           --12.36%--h2_process_demux (inlined)
            |
             --2.98%--h2_process
                       |
                       |--1.63%--h2_process_demux (inlined)
                       |          |
                       |           --0.99%--h2c_frt_handle_headers
                       |                     |
                       |                      --0.70%--h2c_decode_headers
                       |
                        --1.35%--h2_send

CPU usage during this period - 0-3 are allocated to HAProxy (most other CPUs are idle).

$ mpstat -P ALL 1
...
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   21.87    0.00    3.45    0.00    0.00    1.14    0.00    0.00    0.00   73.54
Average:       0   85.00    0.00   12.62    0.00    0.00    2.38    0.00    0.00    0.00    0.00
Average:       1   85.00    0.00   12.54    0.00    0.00    2.46    0.00    0.00    0.00    0.00
Average:       2   85.00    0.00   12.77    0.00    0.00    2.23    0.00    0.00    0.00    0.00
Average:       3   85.91    0.00   11.93    0.00    0.00    2.16    0.00    0.00    0.00    0.00
...
Average:      15    0.62    0.00    0.54    0.00    0.00    0.00    0.00    0.00    0.00   98.84

syscall counts of 2 different HAProxy servers we’re running over a short period. Server A has higher traffic during this period, Server B is the server experiencing the CPU usage issue at the time these samples were taken. Note the huge difference with clock_gettime and epoll_wait on Server B.

Server A

149709  syscalls:sys_enter_recvfrom
140157  syscalls:sys_enter_clock_gettime
84679  syscalls:sys_enter_getpid
71813  syscalls:sys_enter_sendto
70124  syscalls:sys_enter_epoll_wait
28652  syscalls:sys_enter_setsockopt
26697  syscalls:sys_enter_epoll_ctl
20043  syscalls:sys_enter_getsockopt
19297  syscalls:sys_enter_accept4
15505  syscalls:sys_enter_close
10111  syscalls:sys_enter_socket
10111  syscalls:sys_enter_fcntl
10111  syscalls:sys_enter_connect
5206  syscalls:sys_enter_getsockname
3064  syscalls:sys_enter_read
1532  syscalls:sys_enter_write
101 syscalls:sys_enter_shutdown
41  syscalls:sys_enter_futex
12  syscalls:sys_enter_timer_settime
12  syscalls:sys_enter_rt_sigreturn

Server B

1297807  syscalls:sys_enter_clock_gettime
649920  syscalls:sys_enter_epoll_wait
132029  syscalls:sys_enter_recvfrom
71805  syscalls:sys_enter_getpid
64394  syscalls:sys_enter_sendto
26772  syscalls:sys_enter_setsockopt
24462  syscalls:sys_enter_epoll_ctl
16672  syscalls:sys_enter_getsockopt
14206  syscalls:sys_enter_close
12293  syscalls:sys_enter_accept4
9546  syscalls:sys_enter_socket
9546  syscalls:sys_enter_connect
9537  syscalls:sys_enter_fcntl
4525  syscalls:sys_enter_getsockname
72  syscalls:sys_enter_shutdown
62  syscalls:sys_enter_futex
40  syscalls:sys_enter_timer_settime
40  syscalls:sys_enter_rt_sigreturn

Please let me know if there is any other information I can provide.

fitzy

2 Likes

This is probably a bug, can you file an issue at github:

Today 2.4.8 was released, it fixes loops in multiple areas, most importantly in the DNS resolver code. I guess there is no harm in trying to 2.4.8 before diving into it deeper:

https://www.mail-archive.com/haproxy@formilux.org/msg41404.html

1 Like

Thanks @lukastribus, I’ll update to 2.4.8 to see if this resolves the issue.

1 Like