We changed HAProxy configuration so that maxconn
is never reached (will provide config below). This issue happened to us a few times already on both 1.7.11 and 1.8.12. Today one of our HAProxy 1.7.11 instances was down for about 8 minutes because of this same issue. Here is how the haproxy.log looked like at this time:
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38696 [29/Aug/2018:12:13:04.018] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38704 [29/Aug/2018:12:13:04.041] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38729 [29/Aug/2018:12:13:04.119] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38731 [29/Aug/2018:12:13:04.135] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38732 [29/Aug/2018:12:13:04.138] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38734 [29/Aug/2018:12:13:04.142] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38735 [29/Aug/2018:12:13:04.146] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38733 [29/Aug/2018:12:13:04.139] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38738 [29/Aug/2018:12:13:04.151] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38742 [29/Aug/2018:12:13:04.156] www-https/1: SSL handshake failure
...
Aug 29 12:14:19 app-haproxy-02 haproxy[15343]: X.X.X.X:39015 [29/Aug/2018:12:13:54.324] www-https/1: SSL handshake failure
Aug 29 12:14:24 app-haproxy-02 haproxy[15343]: X.X.X.X:51108 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:24 app-haproxy-02 haproxy[15343]: X.X.X.X:42819 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:24 app-haproxy-02 haproxy[15343]: X.X.X.X:42817 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:24 app-haproxy-02 haproxy[15343]: X.X.X.X:42813 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:29 app-haproxy-02 haproxy[15343]: X.X.X.X:48220 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:34 app-haproxy-02 haproxy[15343]: X.X.X.X:48207 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:39 app-haproxy-02 haproxy[15343]: X.X.X.X:48194 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:39 app-haproxy-02 haproxy[15343]: X.X.X.X:43636 [29/Aug/2018:12:14:19.366] www-https/1: Stopped a TLSv1 heartbeat attack (CVE-2014-0160)
Aug 29 12:14:44 app-haproxy-02 haproxy[15343]: X.X.X.X:48151 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:44 app-haproxy-02 haproxy[15343]: X.X.X.X:48082 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:44 app-haproxy-02 haproxy[15343]: X.X.X.X:48062 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:44 app-haproxy-02 haproxy[15343]: X.X.X.X:48036 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
..
Aug 29 12:20:54 app-haproxy-02 haproxy[15343]: X.X.X.X:49328 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:20:59 app-haproxy-02 haproxy[15343]: X.X.X.X:49319 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:04 app-haproxy-02 haproxy[15343]: X.X.X.X:44106 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:09 app-haproxy-02 haproxy[15343]: X.X.X.X:47935 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:09 app-haproxy-02 haproxy[15343]: X.X.X.X:36133 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:14 app-haproxy-02 haproxy[15343]: X.X.X.X:45808 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:19 app-haproxy-02 haproxy[15343]: X.X.X.X:57376 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:19 app-haproxy-02 haproxy[15343]: X.X.X.X:35844 [29/Aug/2018:12:21:19.888] www-https/1: SSL handshake failure
I removed a lot of lines from the log and replaced them with … because they were the same with different times. Line with Stopped a TLSv1 heartbeat attack (CVE-2014-0160)
is very interesting, because it looks like the attacker tried heartbeat, but that probably isn’t the only thing which the attacker tried.
Here is the HAProxy version which is running on CentOS 7:
# haproxy -vv
HA-Proxy version 1.7.11 2018/04/30
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>
Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow
OPTIONS = USE_LINUX_TPROXY=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1
Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with Lua version : Lua 5.3.4
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available filters :
[COMP] compression
[TRACE] trace
[SPOE] spoe
And here is the configuration:
global
log /dev/log local2
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 60000
user haproxy
group haproxy
daemon
stats socket /var/lib/haproxy/stats mode 0777 level admin
tune.ssl.default-dh-param 2048
ssl-default-bind-options no-sslv3
defaults
mode http
log global
maxconn 2000
backlog 32768
retries 3
option httplog
option dontlognull
option dontlog-normal
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
timeout http-request 5s
timeout queue 1m
timeout connect 5s
timeout client 1m
timeout server 1m
timeout http-keep-alive 10s
timeout check 5s
frontend www-http
bind :80
maxconn 20000
acl letsencrypt-acl path_beg /.well-known/acme-challenge/
reqadd X-Forwarded-Proto:\ http
use_backend letsencrypt-backend if letsencrypt-acl
default_backend app-http
frontend www-https
bind :443 ssl crt /etc/letsencrypt/live/app.example.com/fullchain-privkey.pem
acl metric-acl path_beg /metrics
http-request deny if metric-acl
http-request set-header X-Real-IP %[src]
reqadd X-Forwarded-Proto:\ https
default_backend app-http
backend app-http
redirect scheme https if !{ ssl_fc }
balance roundrobin
option httpchk GET /api/v1/health-check/simple-check
default-server inter 2s fastinter 1s rise 2 fall 2 on-marked-down shutdown-sessions
server app-01 app-01:80 check
server app-02 app-02:80 check
backend letsencrypt-backend
server letsencrypt 127.0.0.1:54321
listen stats
bind :9000
mode http
stats enable
stats hide-version
stats uri /
stats refresh 10s
Also because we are monitoring HAProxy with Prometheus we have a HAProxy exporter running locally on this HAProxy node and it is configured to pull stats from the unix socket, so maxconn
shouldn’t be an issue for fetching stats (even thought maxconn
is now configured correctly anyway). It’s interesting that during this attack the exporter couldn’t fetch stats, it was reporting:
time="2018-08-29T12:17:49+02:00" level=error msg="Can't scrape HAProxy: dial unix /var/lib/haproxy/stats: connect: resource temporarily unavailable" source="haproxy_exporter.go:315"
...
time="2018-08-29T12:21:15+02:00" level=error msg="Can't scrape HAProxy: dial unix /var/lib/haproxy/stats: connect: resource temporarily unavailable" source="haproxy_exporter.go:315"
And our HTTP checks to this app also failed during 8 minutes.
Here is one interesting graph from Prometheus:
As you can see from the graph that we have a hole in stats of about 8 minutes (which matches with the haproxy log) and that max sessions on www-https
fronted was 68, which is nothing. I would also like to mention that CPU was about 0%, memory, disk and network didn’t report any activity (except for a few packets more on network, but that is minor). We have instances of HAProxy like this which are serving 5k sessions constantly without any issues, so the host resources were not exhausted.
Any suggestions what we could do to find the problem to the issue?