SSL handshake failure hangs HAProxy


#1

We are using HAProxy 1.8.9, but the same thing happens on 1.7.x versions. From time to time we get the following messages in HAProxy log (source IP is hidden):

Jul 12 15:43:36 hap-01 haproxy[26141]: x.y.z.w:47996 [12/Jul/2018:15:43:36.514] www-https/1: SSL handshake failure
Jul 12 15:43:37 hap-01 haproxy[26141]: x.y.z.w:48986 [12/Jul/2018:15:43:37.071] www-https/1: SSL handshake failure
Jul 12 15:43:37 hap-01 haproxy[26141]: x.y.z.w:49222 [12/Jul/2018:15:43:37.255] www-https/1: SSL handshake failure
Jul 12 15:43:37 hap-01 haproxy[26141]: x.y.z.w:49922 [12/Jul/2018:15:43:37.833] www-https/1: SSL handshake failure
Jul 12 15:43:38 hap-01 haproxy[26141]: x.y.z.w:50294 [12/Jul/2018:15:43:38.188] www-https/1: SSL handshake failure
Jul 12 15:43:53 hap-01 haproxy[26141]: x.y.z.w:55843 [12/Jul/2018:15:43:48.203] www-https/1: SSL handshake failure
Jul 12 15:44:03 hap-01 haproxy[26141]: x.y.z.w:60427 [12/Jul/2018:15:43:53.214] www-https/1: SSL handshake failure
Jul 12 15:44:08 hap-01 haproxy[26141]: x.y.z.w:60423 [12/Jul/2018:15:43:53.214] www-https/1: SSL handshake failure
Jul 12 15:44:08 hap-01 haproxy[26141]: x.y.z.w:50304 [12/Jul/2018:15:43:43.195] www-https/1: SSL handshake failure
Jul 12 15:44:13 hap-01 haproxy[26141]: x.y.z.w:60381 [12/Jul/2018:15:43:53.214] www-https/1: SSL handshake failure
Jul 12 15:44:28 hap-01 haproxy[26141]: x.y.z.w:50308 [12/Jul/2018:15:43:43.195] www-https/1: SSL handshake failure
Jul 12 15:44:38 hap-01 haproxy[26141]: x.y.z.w:52772 [12/Jul/2018:15:44:13.234] www-https/1: SSL handshake failure
Jul 12 15:44:43 hap-01 haproxy[26141]: x.y.z.w:50255 [12/Jul/2018:15:44:38.256] www-https/1: SSL handshake failure
Jul 12 15:44:48 hap-01 haproxy[26141]: x.y.z.w:44924 [12/Jul/2018:15:44:38.256] www-https/1: SSL handshake failure
Jul 12 15:44:48 hap-01 haproxy[26141]: x.y.z.w:34625 [12/Jul/2018:15:44:28.246] www-https/1: SSL handshake failure

During this time HAProxy becomes unresponsive. We are monitoring HAProxy with Prometheus, and the haproxy_exporter reports the following error in it’s log:

time="2018-07-12T15:44:04+02:00" level=error msg="Can't scrape HAProxy: Get http://localhost:9000/;csv: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" source="haproxy_exporter.go:304" 
time="2018-07-12T15:44:09+02:00" level=error msg="Can't scrape HAProxy: Get http://localhost:9000/;csv: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" source="haproxy_exporter.go:304" 
time="2018-07-12T15:44:34+02:00" level=error msg="Can't scrape HAProxy: Get http://localhost:9000/;csv: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" source="haproxy_exporter.go:304" 
time="2018-07-12T15:44:39+02:00" level=error msg="Can't scrape HAProxy: Get http://localhost:9000/;csv: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" source="haproxy_exporter.go:304" 

At first we thought that it is a problem with the exporter, but one of our devs managed to login into the host during this period and his curl requests also timed out. This HAProxy isn’t that active, but this happens on more active ones from time to time. You can view a snapshot of the Grafana dashboard for this host with all metrics collected by node_exporter. Looking at the metrics I would say the there was some connection spam, but doesn’t look anything special to be able to make HAProxy unresponsive.

Do you have an idea what we can do to resolve this issue? Maybe we can enable additional logging or something to get more information.


#2

Please provide the configuration and the output of haproxy -vv first of all.

Most likely miss-configured timeouts are leading to sessions staying there forever.

Also please upgrade to latest 1.8 stable release, 1.8.9 has a lot of issues:
http://www.haproxy.org/bugs/bugs-1.8.9.html


#3

Sorry forgot about the additional info:

# haproxy -vv
HA-Proxy version 1.8.9-83616ec 2018/05/18
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow -Wno-unused-label
  OPTIONS = USE_LINUX_TPROXY=1 USE_CRYPT_H=1 USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_SYSTEMD=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.4
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
        [SPOE] spoe
        [COMP] compression
        [TRACE] trace

It’s CentOS 7.
Here is the config:

global
    log         /dev/log local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    nbthread 2
    maxconn     200000
    user        haproxy
    group       haproxy
    daemon
    stats socket /var/lib/haproxy/stats expose-fd listeners mode 0777 level admin
    tune.ssl.default-dh-param 2048
    ssl-default-bind-options no-sslv3

defaults
    mode http
    log global
    maxconn 200000
    backlog     32768
    retries 3
    option httplog
    option dontlognull
    option dontlog-normal
    option http-server-close
    option forwardfor except 127.0.0.0/8
    option redispatch
    timeout http-request 5s
    timeout queue 1m
    timeout connect 5s
    timeout client 1m
    timeout server 1m
    timeout http-keep-alive 10s
    timeout check 5s

frontend www-http
    bind :80
    acl letsencrypt-acl path_beg /.well-known/acme-challenge/
    reqadd X-Forwarded-Proto:\ http
    use_backend letsencrypt-backend if letsencrypt-acl
    default_backend eventserver-http

frontend www-https
    bind :443 ssl crt /etc/letsencrypt/live/removed.domain.com/fullchain-privkey.pem
    http-request set-header X-Real-IP %[src]
    http-response set-header Access-Control-Allow-Origin "*"
    reqadd X-Forwarded-Proto:\ https
    default_backend eventserver-http

backend eventserver-http
    redirect scheme https if !{ ssl_fc }
    balance roundrobin
    default-server inter 2s fastinter 1s rise 2 fall 2 on-marked-down shutdown-sessions
    server eventserver-01 eventserver-01:9000 check
    server eventserver-02 eventserver-02:9000 check

backend letsencrypt-backend
    server letsencrypt 127.0.0.1:54321

listen stats
    bind :9000
    mode http
    stats enable
    stats hide-version
    stats uri /
    stats refresh 10s

I just changed the name of the domain in the config, everything else is a copy-paste.

I upgraded now to the latest version, but as I stated before, this is happening for a few months to us, and it’s even happening on 1.7.x (multiple ones, latest it happened is 1.7.10).


#4

Ok, so here is the issue:

You’ve configured both global and default maxconn to 200000. Thats a very wrong thing to do, because if something hogs connections, the entire haproxy instance cannot handle new connections anymore, including mgm interface (as you can see not even haproxy_exporter can access haproxy anymore in this situation).

So properly configure maxconn values, meaning:

  • global maxconn needs to cover all connections from the frontend and backend (and mgm as well)
  • frontend maxconn needs to be way lower than global
  • actually configurer server maxconn values
  • also use maxconn in the listen section

So a quick suggestion based on 200000 global maxconn:

  • global maxconn 200000
  • default maxconn 20000 (a 1/10 of the global value)
  • a per server maxconn value of 10000
  • in the stats listen section, set maxconn to something like 1000

Then, wait for an issue to reoccur (it will, since we did not fix the root cause yet), but this time you will be able to understand where haproxy is hogging connections (stats will work, unless stats section is what is hogging connections in the first place).


#5

Thanks, what you said totally makes sense. I’ll set max values correctly and then hopefully we will see what happens.

Stats shouldn’t be hogging the system, because it’s called once in 30 seconds.