Loads of SSL handshake failure errors


#1

Hello,

We have implemented HAProxy as replacement loadbalancer for AWS Application Loadbalancer.
However after some complaints about missing visitors from our customers after switching to HAProxy, we investigated some logs and see a lot of SSL handshake failure errors:

Sep 4 14:18:46 loadbalancer haproxy[21591]: 106.222.222.189:55618 [04/Sep/2018:14:18:36.747] secure-http-in/1: SSL handshake failure
Sep 4 14:18:46 loadbalancer haproxy[21591]: 223.186.100.116:4945 [04/Sep/2018:14:18:35.370] secure-http-in/1: SSL handshake failure
Sep 4 14:18:47 loadbalancer haproxy[21591]: 106.207.103.103:21626 [04/Sep/2018:14:18:23.376] secure-http-in/1: SSL handshake failure
Sep 4 14:18:47 loadbalancer haproxy[21591]: 223.184.31.51:15289 [04/Sep/2018:14:18:27.450] secure-http-in/1: SSL handshake failure
Sep 4 14:18:47 loadbalancer haproxy[21591]: 106.220.80.243:14583 [04/Sep/2018:14:18:29.926] secure-http-in/1: SSL handshake failure
Sep 4 14:18:47 loadbalancer haproxy[21591]: 223.237.203.143:56317 [04/Sep/2018:14:18:27.836] secure-http-in/1: SSL handshake failure
Sep 4 14:18:48 loadbalancer haproxy[21591]: 106.203.140.9:2597 [04/Sep/2018:14:18:30.263] secure-http-in/1: SSL handshake failure

The version we are running:
# haproxy -vv
HA-Proxy version 1.8.13-1ppa1~bionic 2018/08/01

Our setup is as follows: we have 3 haproxy instances in different regions for high availability. Combined with Route53 health checks, we try to make sure that a failing loadbalancer is affordable.

Using letsencrypt we have created multiple certificates which are shared between the loadbalancers. We have 2 listners, 1 for HTTP and 1 for HTTPS. Depending on the requested hostname, we load different sets of backends. Below our configuration

global
    log /dev/log	local1 notice
    chroot      /var/lib/haproxy
    user        haproxy
    group       haproxy
    daemon
    nbproc      1
    nbthread    8
    cpu-map     auto:1/1-36 0-35

    maxconn     1000000

    tune.ssl.cachesize 1000000
    tune.ssl.default-dh-param 2048

    ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS
    ssl-default-bind-options no-sslv3

defaults
    log         global
    mode        http
    option httplog
    option dontlog-normal
    option log-separate-errors
    option dontlognull
    option http-keep-alive
    option log-health-checks
    option redispatch
    timeout     http-keep-alive 60s
    timeout     connect 3100 # 3.1 second
    timeout     client  30s
    timeout     server  30s
    maxconn     500000
    retries     2

frontend secure-http-in
    bind *:443 ssl crt-list /etc/haproxy/cert-list.txt alpn h2,http/1.1
    mode http
    maxconn 1000000
    acl is_zone_a.com hdr_end(host) -i a.com
    use_backend backend_app1 if is_zone_a.com
    acl is_zone_b.com hdr_end(host) -i b.com
    use_backend backend_app2 if is_zone_b.com

frontend http-in
    bind *:80
    mode http
    maxconn 1000000

    acl is_zone_a.com hdr_end(host) -i a.com
    use_backend backend_app1 if is_zone_a.com
    acl is_zone_b.com hdr_end(host) -i b.com
    use_backend backend_app2 if is_zone_b.com

backend backend_app1
    mode http
    balance roundrobin
    http-reuse  always
    option httpchk GET /health.php
    http-check expect status 200
    default-server slowstart 30s check inter 10s fall 3 rise 3

    cookie DSALB insert dynamic
    dynamic-cookie-key MYKEY
    server srv1 172.16.10.1:80
    server srv2 172.16.10.2:80
    server srv3 172.16.10.3:80

backend backend_app2
    mode http
    balance roundrobin
    http-reuse  always
    option httpchk GET /health.php
    http-check expect status 200
    default-server slowstart 30s check inter 10s fall 3 rise 3

    cookie DSALB insert dynamic
    dynamic-cookie-key MYKEY
    server srv4 172.16.10.4:80
    server srv5 172.16.10.5:80
    server srv6 172.16.10.6:80

Is there anyone having similar issues or can help us into the right direction? Thanks in advance!


#2

You need to find out which OS and browser the customers use that doesn’t work, so that a corrective action can be applied.

Also please share:

  • the complete output of haproxy -vv
  • details about your certificates (ECC or RSA or both?)
  • an example of your crt-list
  • an actual production site (so that we I can see for myself) - you can send it to me via PM if you prefer not to publish your employer/customers

Also, a test of such a site on SSLLabs would probably reveal any obvious SSL issues as well.

Generally speaking though, a failed handshake in the logs is nothing to be worried about; you will see a lot of bogus traffic hitting your servers. Instead, this needs to be analyzed on a case by case basis.


#3

Hi lukastribus,

Thank you for your reply. Hereby the public available answers:

haproxy -vv

HA-Proxy version 1.8.13-1ppa1~bionic 2018/08/01
Copyright 2000-2018 Willy Tarreau willy@haproxy.org

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fdebug-prefix-map=/build/haproxy-dJ8nFx/haproxy-1.8.13=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.1.0g 2 Nov 2017
Running on OpenSSL version : OpenSSL 1.1.0g 2 Nov 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.3
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

The certificates are signed by letsencrypt, created with certbot. I believe they are RSA only.
The crt-list contains just domain.com.pem, one pem file per domain per line. These pem files and created based on the command ‘cat fullchain.pem privkey.pem > domain.pem’

In PM I’ll send you a report of SSLLabs and an endpoint to test!


#4

In the configuration above you have included alpn h2,http/1.1, but it doesn’t seem to be actually enabled on the site you send me.

Can you confirm what is actually the case?

I don’t see anything wrong with the configuration, it does requires SNI, so Android 2 and Internet Explorer on Windows XP will not work. Also Java 6 doesn’t work because of the DH-group being 2048 bit (but this doesn’t affect customers accessing with browsers). And if h2 is enabled, Chrome 49 on Windows XP also will not work if you have long URIs or large cookies.

You’ll have to understand what the actual OS/browser is that fails, and that you expect to work.


#5

This is correct. I’ve just disabled this, to see if it would make any difference.

The issue here is that we do worldwide (mobile) marketing. So it could be literally anything from anywhere. The anywhere could be tackled by tracing the IPs, but still it would leave us with the question what device is being used. And because its HTTPS, its hard for me to find out.


#6

Then you’ll have to eliminate SNI (and maybe h2 - until the h2 workaround can be used with openssl stable, as per the other thread).

You’ll have to requst 1 public IP address per certificate. But you can put up to 100 SAN’s in one Let’s encrypt certificate though, so that will help.

With multiple IPs and certificates, avoiding SNI would look like this:

bind 1.2.3.4:443 ssl crt /etc/cert1.pem alpn http/1.1
bind 1.2.3.5:443 ssl crt /etc/cert2.pem alpn http/1.1
bind 1.2.3.6:443 ssl crt /etc/cert3.pem alpn http/1.1
bind 1.2.3.7:443 ssl crt /etc/cert4.pem alpn http/1.1

If you can get a certificate with all the domains you need in it, then you don’t need additional IPs at all.

However, since you just came from AWS Loadbalancer, you probably know, or at least can get the information, if SNI was used or not (it’s the This site works only in browsers with SNI support. on the SSLtest report).

Not being able to reproduce the real issue is certainly limiting your ability to troubleshoot.


#7

At the moment we have more than 100 domains loaded with each their own wildcard certificate. These domains are grouped per ‘company’. For each company we already have a public ip. I could setup your suggestion, but limited to the public ip per company. In that case I can determine for which company the errors occur, but still not per domain.

Grouping these domains in once certificate is not really desirable, in special because there are frequently domains being added or removed, which is not easy to maintain with letsencrypt.

SNI was optional with AWS as we loaded a default certificate per loadbalancer, however we have multiple domains per AWS Loadbalancer, so the chance of a wrong certificate being loaded was quite high.

Now we moved the traffic back to AWS to investigate with HAProxy, we also see some TLS errors with AWS. However its a little easier to see how much of percentage that is. For example, 2 hours ago we processed about 14 million requests through once loadbalancer, and there were about 120.000 Client TLS errors over there, so thats close to 1%.

Do you know if the SSL Handshake Failure messages are being counted somewhere in the HAProxy statistics?


#8

So I’ve split the configuration per IP. Each IP is still hosting multiple certificates, but with the most important one first, resulting to have SNI optional for the first domain.

I’ve cleared all traffic from HAProxy, making the logs pretty clean. After this I’ve ran ssllabs again, and this time, from the IP address of ssllabs, I see a lot of SSL handshake failure errors:Sep 5 14:14:02 loadbalancer haproxy[17372]: 64.41.200.103:59980 [05/Sep/2018:14:14:02.668] secure-http-in-traffic/3: SSL handshake failure

Sep  5 14:14:03 loadbalancer haproxy[17372]: 64.41.200.103:60088 [05/Sep/2018:14:14:03.015] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:03 loadbalancer haproxy[17372]: 64.41.200.103:60088 [05/Sep/2018:14:14:03.015] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:03 loadbalancer haproxy[17372]: 64.41.200.103:60184 [05/Sep/2018:14:14:03.366] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:03 loadbalancer haproxy[17372]: 64.41.200.103:60184 [05/Sep/2018:14:14:03.366] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:03 loadbalancer haproxy[17372]: 64.41.200.103:60280 [05/Sep/2018:14:14:03.719] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:03 loadbalancer haproxy[17372]: 64.41.200.103:60280 [05/Sep/2018:14:14:03.719] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:04 loadbalancer haproxy[17372]: 64.41.200.103:60384 [05/Sep/2018:14:14:04.067] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:04 loadbalancer haproxy[17372]: 64.41.200.103:60384 [05/Sep/2018:14:14:04.067] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:04 loadbalancer haproxy[17372]: 64.41.200.103:60479 [05/Sep/2018:14:14:04.417] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:04 loadbalancer haproxy[17372]: 64.41.200.103:60479 [05/Sep/2018:14:14:04.417] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:04 loadbalancer haproxy[17372]: 64.41.200.103:60566 [05/Sep/2018:14:14:04.765] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:04 loadbalancer haproxy[17372]: 64.41.200.103:60566 [05/Sep/2018:14:14:04.765] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:05 loadbalancer haproxy[17372]: 64.41.200.103:60654 [05/Sep/2018:14:14:05.116] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:05 loadbalancer haproxy[17372]: 64.41.200.103:60654 [05/Sep/2018:14:14:05.116] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:05 loadbalancer haproxy[17372]: 64.41.200.103:60748 [05/Sep/2018:14:14:05.465] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:05 loadbalancer haproxy[17372]: 64.41.200.103:60748 [05/Sep/2018:14:14:05.465] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:05 loadbalancer haproxy[17372]: 64.41.200.103:60834 [05/Sep/2018:14:14:05.815] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:05 loadbalancer haproxy[17372]: 64.41.200.103:60834 [05/Sep/2018:14:14:05.815] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:06 loadbalancer haproxy[17372]: 64.41.200.103:60936 [05/Sep/2018:14:14:06.168] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:06 loadbalancer haproxy[17372]: 64.41.200.103:60936 [05/Sep/2018:14:14:06.168] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:06 loadbalancer haproxy[17372]: 64.41.200.103:32804 [05/Sep/2018:14:14:06.518] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:06 loadbalancer haproxy[17372]: 64.41.200.103:32804 [05/Sep/2018:14:14:06.518] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:07 loadbalancer haproxy[17372]: 64.41.200.103:32900 [05/Sep/2018:14:14:06.869] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:07 loadbalancer haproxy[17372]: 64.41.200.103:32900 [05/Sep/2018:14:14:06.869] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:07 loadbalancer haproxy[17372]: 64.41.200.103:32990 [05/Sep/2018:14:14:07.228] secure-http-in-traffic/3: SSL handshake failure
Sep  5 14:14:07 loadbalancer haproxy[17372]: 64.41.200.103:32990 [05/Sep/2018:14:14:07.228] secure-http-in-traffic/3: SSL handshake failure

The results of SSL Labs say that most browsers are supported, so I wonder what the handshake failure errors are for? We still got the feeling something is ‘wrong’, but there’s no signs anywhere.


#9

I just checked the show info statistics for the admin socket, but they don’t contain a counter for handshake failures:

lukas@www:~$ echo "show info" | sudo socat - /run/haproxy/admin.sock | grep Ssl
MaxSslConns: 0
CurrSslConns: 0
CumSslConns: 19
SslRate: 0
SslRateLimit: 0
MaxSslRate: 2
SslFrontendKeyRate: 0
SslFrontendMaxKeyRate: 1
SslFrontendSessionReuse_pct: 0
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 1
SslCacheLookups: 7
SslCacheMisses: 7
lukas@www:~$

The total number of SSL handshakes would be CumSslConns. So maybe you can confront that number with the number of handshakes failures from your logs to get a percentage of failed handshakes.

I don’t think there is anything wrong at this point. You just have an Internet facing servers that gets a lot of bad handshakes. It can be that some providers try to intercept SSL and change the handshake or try to downgrade it, which current openssl releases are protected against.

From you log it looks like you have a specific IP address that continues to cause handshake failures. At this point I’d suggest a tcpdump (tcpdump -ns0 -i eth0 -w capture-handshake-64.41.200.103.cap host 64.41.200.103) capture of the handshake of that particular IP address, so that we can take a look at that particular handshake.


#10

This specific IP is from SSLLabs directory when doing a test. I’ve started the tcpdump while doing a new check with ssllabs. The results can be found here:
http://www.level23.nl/capture-64.41.200.103.cap.zip

The ‘best’ I can get out of it:

Secure Sockets Layer
TLSv1 Record Layer: Alert (Level: Fatal, Description: Handshake Failure)
Content Type: Alert (21)
Version: TLS 1.0 (0x0301)
Length: 2
Alert Message
Level: Fatal (2)
Description: Handshake Failure (40)

Thats like where my knowledge of tcpdumps end :wink:


#11

Well ok, its obvious that SSLtest is going to generate handshake failures, because that’s all the SSLtest does: sending all kings of old, bogus, obsolete and incorrect SSL client hello’s to understand how the server reacts and what kind of the old junk the server still accepts.

Nothing wrong with that.


#12

Just cleared some counters and enabled some traffic for a short moment on HAProxy:

echo > /var/log/haproxy.log && service haproxy reload && date

Wed Sep 5 15:34:50 CEST 2018

The results:

date && echo -n "Failures: " && cat /var/log/haproxy.log | grep ‘SSL handshake failure’ | wc -l && echo “show info” | sudo socat - /var/run/haproxy.sock | grep CumSslConns

Wed Sep 5 15:40:47 CEST 2018
Failures: 1149
CumSslConns: 3999

date && echo -n "Failures: " && cat /var/log/haproxy.log | grep ‘SSL handshake failure’ | wc -l && echo “show info” | sudo socat - /var/run/haproxy.sock | grep CumSslConns

Wed Sep 5 15:43:35 CEST 2018
Failures: 8625
CumSslConns: 20971

If you ask me, thats quite a portion of failures! At least a lot more then we notice with AWS. Any more ideas?


#13

I agree the percentage is high - but that doesn’t mean there is a problem on your side. A failing handshake may cause the client on the other side to retry forever, causing huge numbers of a ever repeating single handshake failure.

I suggest you take a look at the logs again, and pick an IP address with a large number of failures, but one that is not an artificial simulation such as SSLtest and capture those handshakes. Then we can take a look at those and hopefully find out more.


#14

I’m going to dive deeper into the failed cases. To start with, there’s not a single case where the error occurs only once:

#cat haproxy.log | grep ‘SSL handshake failure’ | awk ‘{print $6}’ | awk -F’:’ ‘{print $1}’ | sort | uniq -u
- No output

Then there’s the IP’s which occur very much more:

cat haproxy.log | grep ‘SSL handshake failure’ | awk ‘{print $6}’ | awk -F’:’ ‘{print $1}’ | sort | uniq -c | sort -n | tail -10

 10 42.109.129.137
 10 42.109.156.104
 12 181.13.77.39
 12 181.9.152.222
 12 186.143.138.185
 12 213.233.132.159
 12 41.215.163.23
 16 190.2.151.105
 24 213.233.132.157
 32 32.215.223.247

However, this doesn’t mean its one and the same visitor. Like said before, we handle a lot of mobile traffic, where operators might proxy their traffic to one and the same outgoing IP address. I’ll try to collect some dumps of traffic where we expect it to be working normal but doesn’t.


#15

So I’ve done some more research and wanted to match all ciphers on the AWS loadbalancer. I noticed that the AWS loadbalancer had the cipher TLS_RSA_WITH_3DES_EDE_CBC_SHA, which was not available on my loadbalancer.

As I’ve installed Ubuntu 18.04 for our new loadbalancers, we also got a new version of openssl, version 1.1.0g. And as of version 1.1.x of openssl, 3DES is disabled by default.

After installing a new loadbalancer with Ubuntu 16.04, shipped with openssl 1.0.2g (easier as downgrading openssl), we were able to configured HAProxy to match the ciphers of the AWS loadbalancer.

Running the same tests give us the following results:

echo > /var/log/haproxy.log && service haproxy reload && date

Fri Sep 7 10:20:34 CEST 2018

date && echo -n "Failures: " && cat /var/log/haproxy.log | grep ‘SSL handshake failure’ | wc -l && echo ‘show info’ | sudo socat - /var/run/haproxy.sock | grep CumSslConns

Fri Sep  7 10:30:04 CEST 2018
Failures: 14
CumSslConns: 28575

As you can see, these counters are much better! We still have to ask ourself the question why there’s so much traffic working on these old ciphers and is it worth to keep supporting them, but thats a whole other question.

I think we can mark this topic as resolved. Thank you very much for the support @lukastribus!


#16

One more question: is there any possibility to check for a succeeded request which cipher was used by the client? This would allow me to put a Tag on these visitors to track them. I’ve tried to search but couldn’t find anything.


Got it:

http-request add-header X-Custom-SSL-Cipher %sslc
http-request add-header X-Custom-SSL-Version %sslv

#17

Makes sense. This could be Internet Explorer on Windows XP or extremely old phones.

You can also put the SSL variables (version and cipher) into haproxy logs, maybe along with the User-Agent, by using a custom log-format.

It would be interesting to see what User-Agent those 3DES clients have.

Interestingly Cloudflare mentioned in April 2017 that they still see 0,1% of traffic with 3DES, so yes, there seems to be quite an amount of those clients out there.


#18

Hi there. I’ve balanced the traffic between both custom loadbalancers with forwarding of the cipher and user agent to the backend.

Now the strange thing happens: I don’t see any 3DES ciphers being used between the two loadbalancers.
If I count the ciphers being used based on the logfiles:

New (Ubuntu 18.04):

cat /var/log/haproxy.log | grep ‘TRAFFIC’ | awk ‘{print $8}’ | sort | uniq -c

1772 ECDHE-RSA-AES128-GCM-SHA256
204 ECDHE-RSA-AES128-SHA
5916 ECDHE-RSA-AES256-GCM-SHA384

Old (Ubuntu 16.04);

cat /var/log/haproxy.log | grep ‘TRAFFIC’ | awk ‘{print $8}’ | sort | uniq -c

1875 ECDHE-RSA-AES128-GCM-SHA256
5730 ECDHE-RSA-AES256-GCM-SHA384
377 ECDHE-RSA-AES256-SHA

If this is true, it might be possible HAProxy 1.8 does have a bug on this newer version of OpenSSL.


#19

I think it’s far more likely that you are affected by changes in openssl, like the removal of SSL_OP_TLS_BLOCK_PADDING_BUG for example.

If you want the confirmation whether this is about a change in openssl, you could port your config to nginx, and try the same there (of course, using the same systems as you used with haproxy).

Btw, your configuration:

ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS

does not permit TLS_RSA_WITH_3DES_EDE_CBC_SHA, whatever the openssl version. You’d have to add it (appending it to the end is fine):

:DES-CBC3-SHA

#20

The configuration I posted was before trying out 3DES. In the meanwhile I already added RSA+3DES to the bind-cipers, causing TLS_RSA_WITH_3DES_EDE_CBC_SHA to show up when running:

nmap --script ssl-enum-ciphers -p 443

However becauses 3DES does not seems to be the case, I removed this from both loadbalancers. The new loadbalancer running on the older version on OpenSSL is not giving the SSL Handshake Failures, even with 3DES disabled so this is not the cause.

More likely its something like you said, caused by changes in OpenSSL. I have to decide for myself to investigate whats the cause or just accept it.