HAProxy community

HAProxy stops serving frontend requests while not closing backend connections at 100% cpu utilisation


#1

Hello everyone

Lately we had a problem which caused our both HAProxies to consume 100% cpu time and stop responding to new frontend connections. While this problem occurs the HAProxy seems to keep all existing backend connections and not closing them over a longer period of time (this information is based one the grafana graphs). The haproxy stops creating log entries and there are no error or warning messages in other logs.

Reloading the haproxy service in this situation allows new frontend connections to the haproxy for a short period of time and our site is served until the first described situation occurs again (30-60min). Restarting the haproxy service solves the problem.

Our haproxies running the same setup:

  • Centos 7 virtual machine on VMWare ESX
  • 4 CPUs and 4GB RAM
  • Kernel 3.10.0-862.3.2.el7.x86_64 #1 SMP
  • HAProxy Version 1.8.14-52e4d43 (also occurred with HAProxy 1.8.9)

Both systems perform ssl offloading and load balance traffic to four backend http server which are on the same network.

Our normal load behaviour on a Saturday looks like this. The archived data is only in 30min intervals, therefore the cpu utilisation is not accurate with 20%.

When the described problem occurs our monitoring records the following data. Both HAProxies stop working at the exact same moment. Therefore, I think that the HAProxy is not the cause of the problem. But the effects lead to a problem on both systems. Also after the peak around 20:30 (08:30 pm) our layer 4 load balancer records a decreasing amount of connections as expected, but the HAProxies keep a high number of connections.


In total this problem occurred three times in the last 1 1/2 month only on a Saturday while higher load situations. Before the first occurrence this setup was running for about 2 1/2 month without any problems.

Unfortunately I’m unable to reproduce this behaviour.

The haproxy.cfg has the following content. I remove names and acls.

global
	maxconn 40000
        nbproc 1
        nbthread 4
        cpu-map auto:1/1-4 0-3

	log 127.0.0.1 local0 notice
	log 127.0.0.1 local0 info #temp

	chroot /var/lib/haproxy
	stats timeout 30s
	user haproxy
	group haproxy
	daemon

	tune.ssl.default-dh-param 2048
	ssl-default-bind-options no-sslv3
        ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK

defaults
	maxconn 40000
	log global
	mode http
	option httplog
	option dontlognull
	timeout http-request 10s
	timeout http-keep-alive 3s
	timeout connect 12s
	timeout queue 60s
	timeout client 60s
	timeout server 60s
	timeout	check 30s

userlist ...

frontend stats_frontend
	bind *:8080
	mode http
	option dontlog-normal
	default_backend stats_backend

frontend http_frontend
	bind *:80

        option http-buffer-request
        declare capture request len 20000
        http-request capture req.body id 0
	capture request header Host len 200
	log-format "%ci:%cp [%tr] %ft %b/%s %Th/%Ti/%TR/%Tq/%Tw/%Tc/%Tr/%Ta/%Tt %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r"

	acl is_crossdomain path -i /crossdomain.xml
	acl is_well_known path_beg -i /.well-known
	acl is_...

	# Redirect acls for old urls
        http-request redirect code 301 location #... (about 30 redirect rules)
	
        redirect scheme https if ...

	use_backend haproxy_backend if is_well_known
	use_backend http_server_backend if ...
	
	# Default backend
	default_backend http_server_backend

frontend https_frontend
	bind *:443 ssl alpn h2,http/1.1 crt ...

        option http-buffer-request
        declare capture request len 20000
        capture request header Host len 200
	log-format "%ci:%cp [%tr] %ft %b/%s %Th/%Ti/%TR/%Tq/%Tw/%Tc/%Tr/%Ta/%Tt %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r"

	acl is_crossdomain path -i /crossdomain.xml
	acl is_well_known path_beg -i /.well-known
	acl ...

	use_backend haproxy_backend if is_well_known
	use_backend http_server_backend if ...

	# Default backend
	default_backend http_server_backend

backend http_server_backend
	balance leastconn
	option http-keep-alive
	option forwardfor
	option httpchk HEAD / HTTP/1.1\r\nHost:\ ...

	acl is_crossdomain capture.req.uri -m str /crossdomain.xml
	acl ...

	http-request deny if ...

	# Request authorization for sites
	http-request auth realm ...

	http-request set-header ...

	# Rewrite request urls 
	reqirep ^([^\ :]*)\ ...

	# default-server changes the default settings for backend servers
        default-server inter 2s downinter 5s rise 3 fall 2
	server httpserver1 10.x.y.z1:30080 check
	server httpserver2 10.x.y.z2:30080 check
	server httpserver3 10.x.y.z3:30080 check
	server httpserver4 10.x.y.z4:30080 check

backend stats_backend
	mode http
	acl is_auth ...
	acl is_admin ...
	stats ...

Does anyone know a similar situation or has an idea what can cause or solve this problem?

Thanks for your help.


#2

That’s a though one to troubleshoot.

A few things:

  • can you provide the output of haproxy -vv
  • can you confirm the CPU usage is spent in userspace and haproxy (not in the kernel)?
  • install strace on both haproxy instances and when in this situation (high CPU, no responses), trace the syscalls that haproxy makes with strace -ttfp <haproxy-PID>. The output may contain confidential data like ip addresses or even parts of the HTTP transaction - but when haproxy is spinning in a busy loop, that likely isn’t even the case. Anyway do consider confidentiality when providing strace output.

#3

Hi lukastribus,

thanks for your reply.

The output of haproxy -vv:

HA-Proxy version 1.8.14-52e4d43 2018/09/20
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow -Wno-unused-label
  OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
	[SPOE] spoe
	[COMP] compression
	[TRACE] trace

Regarding your second question I can’t confirm at the moment that the CPU usage is spent in userspace and haproxy. The processes where restarted by a colleague before I was able to login to the system. I have to wait for the next occurrence of this problem.

I’ve installed strace on both haproxy.


#4

haproxy -vv output is fine.

Other than stracing the process like I told above, I figure that if you use nbproc instead of nbthread, the situation would probably be less critical (a process may spin on 100% but would not take down the entire instance). This would also make troubleshooting easier.

It sounds like a haproxy bug and it could be related to the nbthread feature itself, also.

So if you need a immediate workaround, disabling nbthread could help. Just know that nbproc has some disadvantages, especially regarding stats (which are per-process).


#5

Hi lukastribus,

I checked our configuration history. The first time this problem occured the haproxy was running in single process / single thread mode. I will have a look at the multi processing configuration and may configure it on one node.