"SSL handshake failures" on big amount of requests


#1

Hi guys,
I’d appreciate if anyone can give me couple of suggestions for the issue I have with SSL.
I know that sounds like certificate issue, but it happens only when I have big spike of new connections.

I am running haproxy 1.5.14 on Azure and using SSL termination.
Haproxy works perfectly well when load rises gradually, but everything goes bad if I have instant load.

In normal situation qmax goes up to 3000 and per thread and cpu core is loaded not higher than 75%.

So if I restart haproxy during daily load, haproxy might fill CPU usage up to 100% and be unable to handle more than 700-800 requests per thread.
When it comes to that limit, I see rate of new requests lowers down to 2-5
Haproxy log become mostly filled with tls/1: SSL handshake failure errors.

If I add more haproxy instances into balance, it becomes normal.

I don’t have issues with entropy:

cat /proc/sys/kernel/random/entropy_avail
885

I tried to add conneciton rate limits:

maxsessrate 100
maxsslrate 100
maxconnrate 100
that had no effect. Everything stops at about 800 connections and then whole log filled with SSL handshake failures.

I tried to play around with timeouts
changed timeout connect as:

  • 500
  • 50000
  • 30s

No effect

Can anyone suggest anything here? I have no idea how to debug that.

Here is the config file I use:

global
        log /dev/log    local0
        log /dev/log    local1 notice
        stats socket /var/run/haproxy.p1.sock mode 660 group nagios level admin process 1
        stats socket /var/run/haproxy.p2.sock mode 600 level admin process 2
        stats socket /var/run/haproxy.p3.sock mode 600 level admin process 3
        stats socket /var/run/haproxy.p4.sock mode 600 level admin process 4
        stats timeout 2m  #Wait up to 2 minutes for input
        chroot /var/lib/haproxy
        user haproxy
        group haproxy
        daemon
        nbproc 4
        cpu-map 1 0  # first arg is process number (1-based); second arg is cpu number (0-based)
        cpu-map 2 1
        cpu-map 3 2
        cpu-map 4 3
        # SSL/TLS settings
        ca-base /etc/ssl/certs
        crt-base /etc/ssl/private
        tune.ssl.default-dh-param 2048
        tune.ssl.cachesize 10000000
        tune.ssl.lifetime 86400
        #tune.ssl.maxrecord 2859
        tune.ssl.maxrecord 1400  # TCP window size
        ssl-default-bind-options no-sslv3 no-tls-tickets
        ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK
        maxconn 60000
        maxsslconn 60000
        # maxsessrate 100
        # maxsslrate 100
        # maxconnrate 100
defaults
        log     global
        option  dontlognull
        option  dontlog-normal
        timeout connect 5000
        timeout client 50000
        timeout server 50000
        bind-process all  # not needed, but worthwhile being explicit
listen stats
        bind :2100 process 1
        bind :2101 process 2
        bind :2102 process 3
        bind :2103 process 4
        mode http
        log global
        stats enable
        stats realm stats_process
        stats uri /
        stats refresh 15s
        stats show-legends
        stats show-node
        stats auth xxxxxxxxxxxxx
frontend tls
        mode tcp
        maxconn 60000
        option tcplog
        bind *:443 ssl crt-list /etc/ssl/private/certificates.txt npn http/1.1
        default_backend frontend_service
backend frontend_service
        mode tcp
        option tcplog
        option httpchk GET /status
        fullconn 60000
        # 2 second 'inter'val between health checks. 2 failures to remove a server. 2 successes to add it back
        default-server inter 8s fall 2 rise 2
        timeout check 8s
        balance leastconn
        server SRV1 SRV1:80 maxconn 2000 check port 3000
        ....
        server SRV60 SRV1:80 maxconn 2000 check port 3000

Thank you!
Pavel


#2

[quote=“SorokinPA, post:1, topic:1277”]
So if I restart haproxy during daily load, haproxy might fill CPU usage up to 100% and be unable to handle more than 700-800 requests per thread.
When it comes to that limit, I see rate of new requests lowers down to 2-5[/quote]

As you restart haproxy you drop existing TLS sessions and have to renegotiate everything (because the TLS session cache is lost). This blocks the event loop and slows haproxy down to a crawl.

I would strongly suggest deploying TLS tickets with key rotation [1], this will avoid the TLS handshake penality for browsers using TLS tickets, when restarting haproxy.

This may just be a result of the high number of TLS handshakes that actually are happening in this very moment.

[1] http://cbonte.github.io/haproxy-dconv/1.7/configuration.html#5.1-tls-ticket-keys


#3

That sounds reasonable. Though needs extra care to implement.

Is there any other way to improve performance? I thought session rate limits should work, but they didn’t.
Maybe I missed some options?

Thank you!


#4

Make sure you don’t use RSA certificates larger than 2048, and serve ECDSA certificates. The ECSDA handshake is less CPU heavy for the server side and should help at lot.

Confirm with Vincent’s test tool that SSL session caching and TLS tickets (once you implement it), works fine (and across all instances):