Diagnosing slow connections through HAProxy

Hi

I have a service behind HAProxy that holds on to HTTP connections for a long time (most commonly minutes, but sometimes days). There are also a lot of connections comings and going (this powers a streaming API for mobile clients, so you can imagine how they will lose network connections frequently).

I am noticing that these new connections take a long time to be established. For instance, the P90 connection times (as measured by a client making the connection, and waiting for the initial payload, not waiting for the connection to close) is ~3s via HAProxy, compared to ~300ms bypassing the proxy and hitting the backend directly.

Interestingly, I have a separate HAProxy cluster, with separate backends (running the same code) serving a very similar purpose, but for non-mobile clients. These connections tend to last much longer (and there are also many more of them). The HAProxy configs are identical. In this cluster, everything is fast.

Looking at the logs, this line is representative:

Sep  7 23:08:54 ip-10-10-8-243 ip-10-10-8-243 haproxy[28920]: 111.65.68.204:51450 [07/Sep/2017:23:08:53.233] front-streamer-msdk-443~ streamer-msdk-5051-out/streamer-msdk-eee8bd-10.10.5.52 0/0/555/445/+1000 200 +281 - - ---- 14238/14237/12444/519/0 0/0 "GET /mping HTTP/1.1"

The backend and proxy are both in the same AWS availability zone, and the network latency between these boxes (as measured by ping) is ~0.1ms. However, it took 555ms to establish the connection between the proxy and the backend. The 445ms to get the headers back also seems higher than I’d expect, but I’m focusing on the connect time for now.

Using curl, I tried to measure the connect time to hit the same backend, and I see it is very fast:

curl 10.10.5.52:5051/status --output /dev/null --silent --show-error --write-out 'lookup:        %{time_namelookup}\nconnect:       %{time_connect}\nappconnect:    %{time_appconnect}\npretransfer:   %{time_pretransfer}\nredirect:      %{time_redirect}\nstarttransfer: %{time_starttransfer}\ntotal:         %{time_total}\n'
lookup:        0.000
connect:       0.000
appconnect:    0.000
pretransfer:   0.000
redirect:      0.000
starttransfer: 0.000
total:         0.000

So, something in HAProxy seems to be causing the slowness. The stats page reports:

system limits: memmax = unlimited; ulimit-n = 400069
maxsock = 400069; maxconn = 200000; maxpipes = 0
current conns = 64799; current pipes = 0/0; conn rate = 213/sec
Running tasks: 1/64834; idle = 48 %

The ‘max’ session count for the front end is 65k, and the limit is 200k, so I don’t think we are hitting these limits. Also, the log line quoted above shows that no time was spent waiting in queues.

What should I look at next? What might be causing this slowness?

Thanks!

Please:

  • share the complete configuration
  • post the output of haproxy -vv
  • remove “option logasap” and repost the log
  • check system and userspace CPU load
  • “perf” haproxy, to see where it spends its time

Could be haproxy spends an excessive amount of time doing a specific task that blocks the event loop (like TLS handshakes).

Thanks @lukastribus!

Here is my complete haproxy config: https://gist.github.com/pkaeding/0a20024f2b1059de7cade1fdf89651cf

haproxy -vv:

$ haproxy -vv
HA-Proxy version 1.7.5-1ppa1~xenial 2017/04/06
Copyright 2000-2017 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
  OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with network namespace support

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
	[COMP] compression
	[TRACE] trace
	[SPOE] spoe

After removing option logasap, here are some sample log lines (since these connections are kept open for a long time, these are a few that happened to be short-lived and write their logs):

Sep 13 20:07:26 ip-10-10-0-125 ip-10-10-0-125 haproxy[12903]: 199.207.253.101:16875 [13/Sep/2017:20:07:25.691] front-streamer-msdk-443~ streamer-msdk-5051-out/streamer-msdk-eee8bd-10.10.3.216 0/0/566/441/1007 401 208 - - ---- 25158/25157/23286/970/0 0/0 "GET /mping HTTP/1.1"
Sep 13 20:07:29 ip-10-10-0-125 ip-10-10-0-125 haproxy[12903]: 124.244.93.62:60416 [13/Sep/2017:20:07:27.993] front-streamer-msdk-443~ streamer-msdk-5051-out/streamer-msdk-eee8bd-10.10.5.52 0/0/460/394/1248 200 338 - - ---- 27382/27382/25459/1060/0 0/0 "GET /mping HTTP/1.1"
Sep 13 20:07:51 ip-10-10-0-125 ip-10-10-0-125 haproxy[12903]: 73.168.158.222:63287 [13/Sep/2017:20:07:50.790] front-streamer-msdk-443~ streamer-msdk-5051-out/streamer-msdk-eee8bd-10.10.4.224 0/0/191/95/313 200 338 - - ---- 29901/29901/28693/1195/0 0/0 "GET /mping HTTP/1.1"

CPU load is:

  • user: mostly <10%; some spikes to <15%
  • sys: < 1%

I’m not terribly familiar with perf, but I gathered this from sudo perf top -p 13098 > haproxy-perf-top.txt: https://gist.github.com/pkaeding/048177c0c971a25c03157134d7abc81b

If there is something more useful I could gather with perf, I’d love to learn more.

Also, when I measured the connection times mentioned above (P90 ~3s via HAProxy, compared to ~300ms bypassing the proxy) Those timings were not using the HTTPS frontend, and were just using the HTTP frontend (though all other traffic is entering from HTTPS. I don’t know if that means anything for your theory about TLS handshakes?

Thanks for any tips you might have!

Well the perf output is not very useful, probably because the symbols are optimized away by the compiler. You would have to switch to a debug build to make this useful.

But lets do something simpler to find out if this is caused by SSL, lets switch to 2 processes and dedicated a process to the SSL frontend (so SSL processing does not block the event loop of the port 80 frontend):

global
 nbproc 2
frontend front-stream-unsafe-80
 bind-process 1
frontend front-streamer-msdk-443
 bind-process 2

That doesn’t matter, when SSL handshakes are blocking the event loop, everything that the entire process does blocks (meaning its gonna be delayed).

If the above test turns out to fix the delays for port 80, we would have to take a look at the actual SSL behavior. For example:

  • is the certificate RSA 2048-bit or something else?
  • can you run Vincent’s rfc5077 test tool [1] against your site?
  • can we enable tls-tickets (maybe with rotating keys)?

[1] GitHub - vincentbernat/rfc5077: Various tools for testing RFC 5077

Thanks @lukastribus. I moved the HTTP listener to its own process as you suggested, and the measurements didn’t seem to improve much, if at all. So, I guess that rules out SSL handshakes being the cause?

I notice there is a haproxy-dbg package in the Ubuntu repository. Do I just need to install that to get the debug symbols? Do I need to remove the existing (regular) HAProxy package?

Ok. haproxy-dbg should replace the existing package, I don’t think you have to worry about it.