We have implemented HAProxy as replacement loadbalancer for AWS Application Loadbalancer.
However after some complaints about missing visitors from our customers after switching to HAProxy, we investigated some logs and see a lot of SSL handshake failure errors:
The version we are running:
# haproxy -vv
HA-Proxy version 1.8.13-1ppa1~bionic 2018/08/01
Our setup is as follows: we have 3 haproxy instances in different regions for high availability. Combined with Route53 health checks, we try to make sure that a failing loadbalancer is affordable.
Using letsencrypt we have created multiple certificates which are shared between the loadbalancers. We have 2 listners, 1 for HTTP and 1 for HTTPS. Depending on the requested hostname, we load different sets of backends. Below our configuration
global
log /dev/log local1 notice
chroot /var/lib/haproxy
user haproxy
group haproxy
daemon
nbproc 1
nbthread 8
cpu-map auto:1/1-36 0-35
maxconn 1000000
tune.ssl.cachesize 1000000
tune.ssl.default-dh-param 2048
ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS
ssl-default-bind-options no-sslv3
defaults
log global
mode http
option httplog
option dontlog-normal
option log-separate-errors
option dontlognull
option http-keep-alive
option log-health-checks
option redispatch
timeout http-keep-alive 60s
timeout connect 3100 # 3.1 second
timeout client 30s
timeout server 30s
maxconn 500000
retries 2
frontend secure-http-in
bind *:443 ssl crt-list /etc/haproxy/cert-list.txt alpn h2,http/1.1
mode http
maxconn 1000000
acl is_zone_a.com hdr_end(host) -i a.com
use_backend backend_app1 if is_zone_a.com
acl is_zone_b.com hdr_end(host) -i b.com
use_backend backend_app2 if is_zone_b.com
frontend http-in
bind *:80
mode http
maxconn 1000000
acl is_zone_a.com hdr_end(host) -i a.com
use_backend backend_app1 if is_zone_a.com
acl is_zone_b.com hdr_end(host) -i b.com
use_backend backend_app2 if is_zone_b.com
backend backend_app1
mode http
balance roundrobin
http-reuse always
option httpchk GET /health.php
http-check expect status 200
default-server slowstart 30s check inter 10s fall 3 rise 3
cookie DSALB insert dynamic
dynamic-cookie-key MYKEY
server srv1 172.16.10.1:80
server srv2 172.16.10.2:80
server srv3 172.16.10.3:80
backend backend_app2
mode http
balance roundrobin
http-reuse always
option httpchk GET /health.php
http-check expect status 200
default-server slowstart 30s check inter 10s fall 3 rise 3
cookie DSALB insert dynamic
dynamic-cookie-key MYKEY
server srv4 172.16.10.4:80
server srv5 172.16.10.5:80
server srv6 172.16.10.6:80
Is there anyone having similar issues or can help us into the right direction? Thanks in advance!
You need to find out which OS and browser the customers use that doesnāt work, so that a corrective action can be applied.
Also please share:
the complete output of haproxy -vv
details about your certificates (ECC or RSA or both?)
an example of your crt-list
an actual production site (so that we I can see for myself) - you can send it to me via PM if you prefer not to publish your employer/customers
Also, a test of such a site on SSLLabs would probably reveal any obvious SSL issues as well.
Generally speaking though, a failed handshake in the logs is nothing to be worried about; you will see a lot of bogus traffic hitting your servers. Instead, this needs to be analyzed on a case by case basis.
Built with OpenSSL version : OpenSSL 1.1.0g 2 Nov 2017
Running on OpenSSL version : OpenSSL 1.1.0g 2 Nov 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.3
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity(āidentityā), deflate(ādeflateā), raw-deflate(ādeflateā), gzip(āgzipā)
Built with network namespace support.
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace
The certificates are signed by letsencrypt, created with certbot. I believe they are RSA only.
The crt-list contains just domain.com.pem, one pem file per domain per line. These pem files and created based on the command ācat fullchain.pem privkey.pem > domain.pemā
In PM Iāll send you a report of SSLLabs and an endpoint to test!
In the configuration above you have included alpn h2,http/1.1, but it doesnāt seem to be actually enabled on the site you send me.
Can you confirm what is actually the case?
I donāt see anything wrong with the configuration, it does requires SNI, so Android 2 and Internet Explorer on Windows XP will not work. Also Java 6 doesnāt work because of the DH-group being 2048 bit (but this doesnāt affect customers accessing with browsers). And if h2 is enabled, Chrome 49 on Windows XP also will not work if you have long URIs or large cookies.
Youāll have to understand what the actual OS/browser is that fails, and that you expect to work.
This is correct. Iāve just disabled this, to see if it would make any difference.
The issue here is that we do worldwide (mobile) marketing. So it could be literally anything from anywhere. The anywhere could be tackled by tracing the IPs, but still it would leave us with the question what device is being used. And because its HTTPS, its hard for me to find out.
Then youāll have to eliminate SNI (and maybe h2 - until the h2 workaround can be used with openssl stable, as per the other thread).
Youāll have to requst 1 public IP address per certificate. But you can put up to 100 SANās in one Letās encrypt certificate though, so that will help.
With multiple IPs and certificates, avoiding SNI would look like this:
If you can get a certificate with all the domains you need in it, then you donāt need additional IPs at all.
However, since you just came from AWS Loadbalancer, you probably know, or at least can get the information, if SNI was used or not (itās the This site works only in browsers with SNI support. on the SSLtest report).
Not being able to reproduce the real issue is certainly limiting your ability to troubleshoot.
At the moment we have more than 100 domains loaded with each their own wildcard certificate. These domains are grouped per ācompanyā. For each company we already have a public ip. I could setup your suggestion, but limited to the public ip per company. In that case I can determine for which company the errors occur, but still not per domain.
Grouping these domains in once certificate is not really desirable, in special because there are frequently domains being added or removed, which is not easy to maintain with letsencrypt.
SNI was optional with AWS as we loaded a default certificate per loadbalancer, however we have multiple domains per AWS Loadbalancer, so the chance of a wrong certificate being loaded was quite high.
Now we moved the traffic back to AWS to investigate with HAProxy, we also see some TLS errors with AWS. However its a little easier to see how much of percentage that is. For example, 2 hours ago we processed about 14 million requests through once loadbalancer, and there were about 120.000 Client TLS errors over there, so thats close to 1%.
Do you know if the SSL Handshake Failure messages are being counted somewhere in the HAProxy statistics?
So Iāve split the configuration per IP. Each IP is still hosting multiple certificates, but with the most important one first, resulting to have SNI optional for the first domain.
Iāve cleared all traffic from HAProxy, making the logs pretty clean. After this Iāve ran ssllabs again, and this time, from the IP address of ssllabs, I see a lot of SSL handshake failure errors:Sep 5 14:14:02 loadbalancer haproxy[17372]: 64.41.200.103:59980 [05/Sep/2018:14:14:02.668] secure-http-in-traffic/3: SSL handshake failure
The results of SSL Labs say that most browsers are supported, so I wonder what the handshake failure errors are for? We still got the feeling something is āwrongā, but thereās no signs anywhere.
The total number of SSL handshakes would be CumSslConns. So maybe you can confront that number with the number of handshakes failures from your logs to get a percentage of failed handshakes.
I donāt think there is anything wrong at this point. You just have an Internet facing servers that gets a lot of bad handshakes. It can be that some providers try to intercept SSL and change the handshake or try to downgrade it, which current openssl releases are protected against.
From you log it looks like you have a specific IP address that continues to cause handshake failures. At this point Iād suggest a tcpdump (tcpdump -ns0 -i eth0 -w capture-handshake-64.41.200.103.cap host 64.41.200.103) capture of the handshake of that particular IP address, so that we can take a look at that particular handshake.
This specific IP is from SSLLabs directory when doing a test. Iāve started the tcpdump while doing a new check with ssllabs. The results can be found here: http://www.level23.nl/capture-64.41.200.103.cap.zip
Well ok, its obvious that SSLtest is going to generate handshake failures, because thatās all the SSLtest does: sending all kings of old, bogus, obsolete and incorrect SSL client helloās to understand how the server reacts and what kind of the old junk the server still accepts.
I agree the percentage is high - but that doesnāt mean there is a problem on your side. A failing handshake may cause the client on the other side to retry forever, causing huge numbers of a ever repeating single handshake failure.
I suggest you take a look at the logs again, and pick an IP address with a large number of failures, but one that is not an artificial simulation such as SSLtest and capture those handshakes. Then we can take a look at those and hopefully find out more.
However, this doesnāt mean its one and the same visitor. Like said before, we handle a lot of mobile traffic, where operators might proxy their traffic to one and the same outgoing IP address. Iāll try to collect some dumps of traffic where we expect it to be working normal but doesnāt.
So Iāve done some more research and wanted to match all ciphers on the AWS loadbalancer. I noticed that the AWS loadbalancer had the cipher TLS_RSA_WITH_3DES_EDE_CBC_SHA, which was not available on my loadbalancer.
As Iāve installed Ubuntu 18.04 for our new loadbalancers, we also got a new version of openssl, version 1.1.0g. And as of version 1.1.x of openssl, 3DES is disabled by default.
After installing a new loadbalancer with Ubuntu 16.04, shipped with openssl 1.0.2g (easier as downgrading openssl), we were able to configured HAProxy to match the ciphers of the AWS loadbalancer.
Running the same tests give us the following results:
echo > /var/log/haproxy.log && service haproxy reload && date
Fri Sep 7 10:30:04 CEST 2018
Failures: 14
CumSslConns: 28575
As you can see, these counters are much better! We still have to ask ourself the question why thereās so much traffic working on these old ciphers and is it worth to keep supporting them, but thats a whole other question.
I think we can mark this topic as resolved. Thank you very much for the support @lukastribus!
One more question: is there any possibility to check for a succeeded request which cipher was used by the client? This would allow me to put a Tag on these visitors to track them. Iāve tried to search but couldnāt find anything.
Makes sense. This could be Internet Explorer on Windows XP or extremely old phones.
You can also put the SSL variables (version and cipher) into haproxy logs, maybe along with the User-Agent, by using a custom log-format.
It would be interesting to see what User-Agent those 3DES clients have.
Interestingly Cloudflare mentioned in April 2017 that they still see 0,1% of traffic with 3DES, so yes, there seems to be quite an amount of those clients out there.
Hi there. Iāve balanced the traffic between both custom loadbalancers with forwarding of the cipher and user agent to the backend.
Now the strange thing happens: I donāt see any 3DES ciphers being used between the two loadbalancers.
If I count the ciphers being used based on the logfiles:
I think itās far more likely that you are affected by changes in openssl, like the removal of SSL_OP_TLS_BLOCK_PADDING_BUG for example.
If you want the confirmation whether this is about a change in openssl, you could port your config to nginx, and try the same there (of course, using the same systems as you used with haproxy).
The configuration I posted was before trying out 3DES. In the meanwhile I already added RSA+3DES to the bind-cipers, causing TLS_RSA_WITH_3DES_EDE_CBC_SHA to show up when running:
nmap --script ssl-enum-ciphers -p 443
However becauses 3DES does not seems to be the case, I removed this from both loadbalancers. The new loadbalancer running on the older version on OpenSSL is not giving the SSL Handshake Failures, even with 3DES disabled so this is not the cause.
More likely its something like you said, caused by changes in OpenSSL. I have to decide for myself to investigate whats the cause or just accept it.