SSL handshake failure hangs HAProxy


#1

We are using HAProxy 1.8.9, but the same thing happens on 1.7.x versions. From time to time we get the following messages in HAProxy log (source IP is hidden):

Jul 12 15:43:36 hap-01 haproxy[26141]: x.y.z.w:47996 [12/Jul/2018:15:43:36.514] www-https/1: SSL handshake failure
Jul 12 15:43:37 hap-01 haproxy[26141]: x.y.z.w:48986 [12/Jul/2018:15:43:37.071] www-https/1: SSL handshake failure
Jul 12 15:43:37 hap-01 haproxy[26141]: x.y.z.w:49222 [12/Jul/2018:15:43:37.255] www-https/1: SSL handshake failure
Jul 12 15:43:37 hap-01 haproxy[26141]: x.y.z.w:49922 [12/Jul/2018:15:43:37.833] www-https/1: SSL handshake failure
Jul 12 15:43:38 hap-01 haproxy[26141]: x.y.z.w:50294 [12/Jul/2018:15:43:38.188] www-https/1: SSL handshake failure
Jul 12 15:43:53 hap-01 haproxy[26141]: x.y.z.w:55843 [12/Jul/2018:15:43:48.203] www-https/1: SSL handshake failure
Jul 12 15:44:03 hap-01 haproxy[26141]: x.y.z.w:60427 [12/Jul/2018:15:43:53.214] www-https/1: SSL handshake failure
Jul 12 15:44:08 hap-01 haproxy[26141]: x.y.z.w:60423 [12/Jul/2018:15:43:53.214] www-https/1: SSL handshake failure
Jul 12 15:44:08 hap-01 haproxy[26141]: x.y.z.w:50304 [12/Jul/2018:15:43:43.195] www-https/1: SSL handshake failure
Jul 12 15:44:13 hap-01 haproxy[26141]: x.y.z.w:60381 [12/Jul/2018:15:43:53.214] www-https/1: SSL handshake failure
Jul 12 15:44:28 hap-01 haproxy[26141]: x.y.z.w:50308 [12/Jul/2018:15:43:43.195] www-https/1: SSL handshake failure
Jul 12 15:44:38 hap-01 haproxy[26141]: x.y.z.w:52772 [12/Jul/2018:15:44:13.234] www-https/1: SSL handshake failure
Jul 12 15:44:43 hap-01 haproxy[26141]: x.y.z.w:50255 [12/Jul/2018:15:44:38.256] www-https/1: SSL handshake failure
Jul 12 15:44:48 hap-01 haproxy[26141]: x.y.z.w:44924 [12/Jul/2018:15:44:38.256] www-https/1: SSL handshake failure
Jul 12 15:44:48 hap-01 haproxy[26141]: x.y.z.w:34625 [12/Jul/2018:15:44:28.246] www-https/1: SSL handshake failure

During this time HAProxy becomes unresponsive. We are monitoring HAProxy with Prometheus, and the haproxy_exporter reports the following error in it’s log:

time="2018-07-12T15:44:04+02:00" level=error msg="Can't scrape HAProxy: Get http://localhost:9000/;csv: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" source="haproxy_exporter.go:304" 
time="2018-07-12T15:44:09+02:00" level=error msg="Can't scrape HAProxy: Get http://localhost:9000/;csv: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" source="haproxy_exporter.go:304" 
time="2018-07-12T15:44:34+02:00" level=error msg="Can't scrape HAProxy: Get http://localhost:9000/;csv: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" source="haproxy_exporter.go:304" 
time="2018-07-12T15:44:39+02:00" level=error msg="Can't scrape HAProxy: Get http://localhost:9000/;csv: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" source="haproxy_exporter.go:304" 

At first we thought that it is a problem with the exporter, but one of our devs managed to login into the host during this period and his curl requests also timed out. This HAProxy isn’t that active, but this happens on more active ones from time to time. You can view a snapshot of the Grafana dashboard for this host with all metrics collected by node_exporter. Looking at the metrics I would say the there was some connection spam, but doesn’t look anything special to be able to make HAProxy unresponsive.

Do you have an idea what we can do to resolve this issue? Maybe we can enable additional logging or something to get more information.


#2

Please provide the configuration and the output of haproxy -vv first of all.

Most likely miss-configured timeouts are leading to sessions staying there forever.

Also please upgrade to latest 1.8 stable release, 1.8.9 has a lot of issues:
http://www.haproxy.org/bugs/bugs-1.8.9.html


#3

Sorry forgot about the additional info:

# haproxy -vv
HA-Proxy version 1.8.9-83616ec 2018/05/18
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow -Wno-unused-label
  OPTIONS = USE_LINUX_TPROXY=1 USE_CRYPT_H=1 USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_SYSTEMD=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.4
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
        [SPOE] spoe
        [COMP] compression
        [TRACE] trace

It’s CentOS 7.
Here is the config:

global
    log         /dev/log local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    nbthread 2
    maxconn     200000
    user        haproxy
    group       haproxy
    daemon
    stats socket /var/lib/haproxy/stats expose-fd listeners mode 0777 level admin
    tune.ssl.default-dh-param 2048
    ssl-default-bind-options no-sslv3

defaults
    mode http
    log global
    maxconn 200000
    backlog     32768
    retries 3
    option httplog
    option dontlognull
    option dontlog-normal
    option http-server-close
    option forwardfor except 127.0.0.0/8
    option redispatch
    timeout http-request 5s
    timeout queue 1m
    timeout connect 5s
    timeout client 1m
    timeout server 1m
    timeout http-keep-alive 10s
    timeout check 5s

frontend www-http
    bind :80
    acl letsencrypt-acl path_beg /.well-known/acme-challenge/
    reqadd X-Forwarded-Proto:\ http
    use_backend letsencrypt-backend if letsencrypt-acl
    default_backend eventserver-http

frontend www-https
    bind :443 ssl crt /etc/letsencrypt/live/removed.domain.com/fullchain-privkey.pem
    http-request set-header X-Real-IP %[src]
    http-response set-header Access-Control-Allow-Origin "*"
    reqadd X-Forwarded-Proto:\ https
    default_backend eventserver-http

backend eventserver-http
    redirect scheme https if !{ ssl_fc }
    balance roundrobin
    default-server inter 2s fastinter 1s rise 2 fall 2 on-marked-down shutdown-sessions
    server eventserver-01 eventserver-01:9000 check
    server eventserver-02 eventserver-02:9000 check

backend letsencrypt-backend
    server letsencrypt 127.0.0.1:54321

listen stats
    bind :9000
    mode http
    stats enable
    stats hide-version
    stats uri /
    stats refresh 10s

I just changed the name of the domain in the config, everything else is a copy-paste.

I upgraded now to the latest version, but as I stated before, this is happening for a few months to us, and it’s even happening on 1.7.x (multiple ones, latest it happened is 1.7.10).


#4

Ok, so here is the issue:

You’ve configured both global and default maxconn to 200000. Thats a very wrong thing to do, because if something hogs connections, the entire haproxy instance cannot handle new connections anymore, including mgm interface (as you can see not even haproxy_exporter can access haproxy anymore in this situation).

So properly configure maxconn values, meaning:

  • global maxconn needs to cover all connections from the frontend and backend (and mgm as well)
  • frontend maxconn needs to be way lower than global
  • actually configurer server maxconn values
  • also use maxconn in the listen section

So a quick suggestion based on 200000 global maxconn:

  • global maxconn 200000
  • default maxconn 20000 (a 1/10 of the global value)
  • a per server maxconn value of 10000
  • in the stats listen section, set maxconn to something like 1000

Then, wait for an issue to reoccur (it will, since we did not fix the root cause yet), but this time you will be able to understand where haproxy is hogging connections (stats will work, unless stats section is what is hogging connections in the first place).


#5

Thanks, what you said totally makes sense. I’ll set max values correctly and then hopefully we will see what happens.

Stats shouldn’t be hogging the system, because it’s called once in 30 seconds.


#6

We changed HAProxy configuration so that maxconn is never reached (will provide config below). This issue happened to us a few times already on both 1.7.11 and 1.8.12. Today one of our HAProxy 1.7.11 instances was down for about 8 minutes because of this same issue. Here is how the haproxy.log looked like at this time:

Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38696 [29/Aug/2018:12:13:04.018] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38704 [29/Aug/2018:12:13:04.041] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38729 [29/Aug/2018:12:13:04.119] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38731 [29/Aug/2018:12:13:04.135] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38732 [29/Aug/2018:12:13:04.138] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38734 [29/Aug/2018:12:13:04.142] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38735 [29/Aug/2018:12:13:04.146] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38733 [29/Aug/2018:12:13:04.139] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38738 [29/Aug/2018:12:13:04.151] www-https/1: SSL handshake failure
Aug 29 12:13:04 app-haproxy-02 haproxy[15343]: X.X.X.X:38742 [29/Aug/2018:12:13:04.156] www-https/1: SSL handshake failure
...
Aug 29 12:14:19 app-haproxy-02 haproxy[15343]: X.X.X.X:39015 [29/Aug/2018:12:13:54.324] www-https/1: SSL handshake failure
Aug 29 12:14:24 app-haproxy-02 haproxy[15343]: X.X.X.X:51108 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:24 app-haproxy-02 haproxy[15343]: X.X.X.X:42819 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:24 app-haproxy-02 haproxy[15343]: X.X.X.X:42817 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:24 app-haproxy-02 haproxy[15343]: X.X.X.X:42813 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:29 app-haproxy-02 haproxy[15343]: X.X.X.X:48220 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:34 app-haproxy-02 haproxy[15343]: X.X.X.X:48207 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:39 app-haproxy-02 haproxy[15343]: X.X.X.X:48194 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:39 app-haproxy-02 haproxy[15343]: X.X.X.X:43636 [29/Aug/2018:12:14:19.366] www-https/1: Stopped a TLSv1 heartbeat attack (CVE-2014-0160)
Aug 29 12:14:44 app-haproxy-02 haproxy[15343]: X.X.X.X:48151 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:44 app-haproxy-02 haproxy[15343]: X.X.X.X:48082 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:44 app-haproxy-02 haproxy[15343]: X.X.X.X:48062 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
Aug 29 12:14:44 app-haproxy-02 haproxy[15343]: X.X.X.X:48036 [29/Aug/2018:12:14:19.366] www-https/1: SSL handshake failure
..
Aug 29 12:20:54 app-haproxy-02 haproxy[15343]: X.X.X.X:49328 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:20:59 app-haproxy-02 haproxy[15343]: X.X.X.X:49319 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:04 app-haproxy-02 haproxy[15343]: X.X.X.X:44106 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:09 app-haproxy-02 haproxy[15343]: X.X.X.X:47935 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:09 app-haproxy-02 haproxy[15343]: X.X.X.X:36133 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:14 app-haproxy-02 haproxy[15343]: X.X.X.X:45808 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:19 app-haproxy-02 haproxy[15343]: X.X.X.X:57376 [29/Aug/2018:12:17:24.597] www-https/1: SSL handshake failure
Aug 29 12:21:19 app-haproxy-02 haproxy[15343]: X.X.X.X:35844 [29/Aug/2018:12:21:19.888] www-https/1: SSL handshake failure

I removed a lot of lines from the log and replaced them with … because they were the same with different times. Line with Stopped a TLSv1 heartbeat attack (CVE-2014-0160) is very interesting, because it looks like the attacker tried heartbeat, but that probably isn’t the only thing which the attacker tried.

Here is the HAProxy version which is running on CentOS 7:

# haproxy -vv
HA-Proxy version 1.7.11 2018/04/30
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow
  OPTIONS = USE_LINUX_TPROXY=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with Lua version : Lua 5.3.4
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
        [COMP] compression
        [TRACE] trace
        [SPOE] spoe

And here is the configuration:

global
    log         /dev/log local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     60000
    user        haproxy
    group       haproxy
    daemon
    stats socket /var/lib/haproxy/stats mode 0777 level admin
    tune.ssl.default-dh-param 2048
    ssl-default-bind-options no-sslv3

defaults
    mode http
    log global
    maxconn 2000
    backlog     32768
    retries 3
    option httplog
    option dontlognull
    option dontlog-normal
    option http-server-close
    option forwardfor except 127.0.0.0/8
    option redispatch
    timeout http-request 5s
    timeout queue 1m
    timeout connect 5s
    timeout client 1m
    timeout server 1m
    timeout http-keep-alive 10s
    timeout check 5s

frontend www-http
    bind :80
    maxconn 20000
    acl letsencrypt-acl path_beg /.well-known/acme-challenge/
    reqadd X-Forwarded-Proto:\ http
    use_backend letsencrypt-backend if letsencrypt-acl
    default_backend app-http

frontend www-https
    bind :443 ssl crt /etc/letsencrypt/live/app.example.com/fullchain-privkey.pem
    acl metric-acl path_beg /metrics
    http-request deny if metric-acl
    http-request set-header X-Real-IP %[src]
    reqadd X-Forwarded-Proto:\ https
    default_backend app-http

backend app-http
    redirect scheme https if !{ ssl_fc }
    balance roundrobin
    option httpchk GET /api/v1/health-check/simple-check
    default-server inter 2s fastinter 1s rise 2 fall 2 on-marked-down shutdown-sessions
    server app-01 app-01:80 check
    server app-02 app-02:80 check

backend letsencrypt-backend
    server letsencrypt 127.0.0.1:54321

listen stats
    bind :9000
    mode http
    stats enable
    stats hide-version
    stats uri /
    stats refresh 10s

Also because we are monitoring HAProxy with Prometheus we have a HAProxy exporter running locally on this HAProxy node and it is configured to pull stats from the unix socket, so maxconn shouldn’t be an issue for fetching stats (even thought maxconn is now configured correctly anyway). It’s interesting that during this attack the exporter couldn’t fetch stats, it was reporting:

time="2018-08-29T12:17:49+02:00" level=error msg="Can't scrape HAProxy: dial unix /var/lib/haproxy/stats: connect: resource temporarily unavailable" source="haproxy_exporter.go:315" 
...
time="2018-08-29T12:21:15+02:00" level=error msg="Can't scrape HAProxy: dial unix /var/lib/haproxy/stats: connect: resource temporarily unavailable" source="haproxy_exporter.go:315"

And our HTTP checks to this app also failed during 8 minutes.

Here is one interesting graph from Prometheus:

As you can see from the graph that we have a hole in stats of about 8 minutes (which matches with the haproxy log) and that max sessions on www-https fronted was 68, which is nothing. I would also like to mention that CPU was about 0%, memory, disk and network didn’t report any activity (except for a few packets more on network, but that is minor). We have instances of HAProxy like this which are serving 5k sessions constantly without any issues, so the host resources were not exhausted.

Any suggestions what we could do to find the problem to the issue?


#7

Are you sure the OS is fine while this happens? Do you have any messages in dmesg?

Can you attach strace to the haproxy process (strace -tt -p<PID>), while in this situation?


#8

Yes the hosts is fine. We have a node_exporter running on the host as well and Prometheus is pulling all OS performance metrics from it and all looks calm and server isn’t stressed or anything. dmesg is the same as after boot and there is nothing in /var/log/messages as well.

I don’t have strace because this is currently a low priority alert for us, so we don’t get paged when it happens, but I’ll write some script now which will launch strace when HAProxy hangs like this. Hopefully it will show us something.


#9

From the straces you send me I can see that there are poll() calls and also timejumps of a few seconds. As you compiled with epoll support and haproxy is in fact otherwise using epoll, this can come only from a library and looking more closesly into what those calls do make me suspect that openssl triggers a code path that leads down to libkrb5, which reminds me of the packet of death investigation.

If this is a VM and there are timejumps and libkrb5 has problems with it, then this may be an explanation.

Even if not, kerberos ciphers should not be used on the public Internet anyway, so the ciphers should be configured to something more appropiate.

Can you put this long configuration line into your global section and monitor the situation?

ssl-default-bind-ciphers ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:DES-CBC3-SHA:HIGH:SEED:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!RSAPSK:!aDH:!aECDH:!EDH-DSS-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA:!SRP

Even if this does not fix the problem, it’s good to limit the ciphers to a secure list by configuring them properly. This was taken from the Mozilla Server Side TLS wiki, using the Old backward compatibility, so we don’t brake stuff.

However I believe there is good chance that this may actually be the culprit.


#10

Thanks for your help. I will set it up and return back if we continue having this issue.

Would it make sense to disable Kerberos ciphers by default in HAProxy?


#11

No, I don’t think that makes sense. If we enable single ciphers (like the one from the Mozilla configuration) we need to actively maintain them and make sure updates get pushed out to all users if any of those has to be removed, which is basically impossible and will suddenly break the setup of users.

If on the other hand, we use something like HIGH:!aNULL:!MD5 as the default (which is what nginx does), it doesn’t fix the problem, as kerberos ciphers are still enabled (at least in CentOS 6.9).

We prefer to do it like Apache does: don’t overwrite the defaults of openssl, unless the users actively chooses to do so.


#12

Make sense. Anyway thanks for the help, I will report back if this mitigates the issue.