Seeing 100% CPU on random times, stability issues

Hello Haproxy users,

We have been seeing our 20 or so HAproxy nodes spike up to 99% CPU on the cluster at random times which is causing our site to stop taking traffic for about 10 minutes of time and causing our Autoscaling group in AWS to spike up to our maximum of 40 nodes to cover the traffic.

We are using a recent version of HAproxy (2.4.17) . We are not restarting the systems or processes at any time other than when it starts up by itself in our config. We use AWS to Autoscale hosts at 20% cpu (which seems low, but we know that we overstress HAProxy too much, it seems to topple over) . Perhaps we are doing something wrong with our configs and it’s not optimal or something?

We use a minimum of 20 c5.xlarges

Here is a copy of our configs with private information excluded:

`global
log 127.0.0.1 local6
maxconn 30720
user haproxy
group haproxy
daemon
quiet
pidfile /var/run/haproxy.pid
tune.maxrewrite 16384
tune.bufsize 65536
nbproc 1
nbthread 8
stats socket /var/run/monitor-haproxy/haproxy.sock mode 0600 level admin

defaults
log global
mode http
option httpchk GET /health/
option httplog
option dontlognull
option dontlog-normal
option log-separate-errors
retries 3
option redispatch
maxconn 30720
timeout connect 5s
timeout client 60s
timeout server 60s
option http-server-close
option tcp-smart-accept
option tcp-smart-connect
option allbackups
stats enable
stats show-legends
stats uri /lbstats
stats refresh 10
stats realm Florin
stats auth username:passsssword1234
balance roundrobin
hash-type consistent djb2 avalanche
no option log-health-checks

resolvers localdns
nameserver localdns 127.0.0.1:53

frontend stats
bind *:7684
stats enable
stats refresh 10s
stats realm Monitor
stats uri /hc

frontend http-in
bind *:80
default_backend xxxxxxx

frontend https-in
bind *:443 ssl crt /etc/ssl/private/self_signed.pem`

A bunch of acl to identify traffic

4 backends to distribute traffic based off of acl readings

backend xxxxx
option httpchk GET /favicon.ico HTTP/1.1\r\nHost:\ 10.78.0.0
server-template xxxxxxx 1-12 internal-xxxxxx.us-east-1.elb.amazonaws.com:443 resolvers localdns resolve-prefer ipv4 init-addr last,libc,none check ssl verify none no-sslv3 inter 10s fall 4

backend xxxxx-xxxxxxx
timeout queue 10s
option httpchk GET /favicon.ico HTTP/1.1\r\nHost:\ 10.78.0.0
server-template ABC 1-12 internal-xxx.us-east-1.elb.amazonaws.com:443 resolvers localdns resolve-prefer ipv4 init-addr last,libc,none check ssl verify none no-sslv3 inter 10s fall 4 maxconn 200
http-response return status 429 default-errorfiles if { status 503 }

backend xxxxxxx
option httpchk GET /status HTTP/1.1\r\nHost:\ 10.78.0.0
server-template ABCD 1-12 internal-xxxx.us-east-1.elb.amazonaws.com:443 resolvers localdns resolve-prefer ipv4 init-addr last,libc,none check ssl verify none no-sslv3 inter 10s fall 4

backend xxxxxxx-bots
timeout queue 10s
option httpchk GET /status HTTP/1.1\r\nHost:\ 10.78.0.0
server-template ABCDEFC 1-12 internal-xxxxx.us-east-1.elb.amazonaws.com:443 resolvers localdns resolve-prefer ipv4 init-addr last,libc,none check ssl verify none no-sslv3 inter 10s fall 4 maxconn 200
http-response return status 429 default-errorfiles if { status 503 }
`

I feel like we are not fully utilizing Haproxy with these settings for some reason.

Any help would be gladly appreciated!

Thanks
Tesla

There is nothing that immediately jumps out in your configuration.

There are some bugfixes in 2.4.18 and subsequently in the git branch of 2.4 currently unreleased.

If you can, upgrade to a recent haproxy 2.4 snapshot or at least to 2.4.18.

If this doesn’t change anything, I suggest you open an issue on the bug tracker, as more complex troubleshooting can be done there: