Hi
I have a service behind HAProxy that holds on to HTTP connections for a long time (most commonly minutes, but sometimes days). There are also a lot of connections comings and going (this powers a streaming API for mobile clients, so you can imagine how they will lose network connections frequently).
I am noticing that these new connections take a long time to be established. For instance, the P90 connection times (as measured by a client making the connection, and waiting for the initial payload, not waiting for the connection to close) is ~3s via HAProxy, compared to ~300ms bypassing the proxy and hitting the backend directly.
Interestingly, I have a separate HAProxy cluster, with separate backends (running the same code) serving a very similar purpose, but for non-mobile clients. These connections tend to last much longer (and there are also many more of them). The HAProxy configs are identical. In this cluster, everything is fast.
Looking at the logs, this line is representative:
Sep 7 23:08:54 ip-10-10-8-243 ip-10-10-8-243 haproxy[28920]: 111.65.68.204:51450 [07/Sep/2017:23:08:53.233] front-streamer-msdk-443~ streamer-msdk-5051-out/streamer-msdk-eee8bd-10.10.5.52 0/0/555/445/+1000 200 +281 - - ---- 14238/14237/12444/519/0 0/0 "GET /mping HTTP/1.1"
The backend and proxy are both in the same AWS availability zone, and the network latency between these boxes (as measured by ping) is ~0.1ms. However, it took 555ms to establish the connection between the proxy and the backend. The 445ms to get the headers back also seems higher than I’d expect, but I’m focusing on the connect time for now.
Using curl
, I tried to measure the connect time to hit the same backend, and I see it is very fast:
curl 10.10.5.52:5051/status --output /dev/null --silent --show-error --write-out 'lookup: %{time_namelookup}\nconnect: %{time_connect}\nappconnect: %{time_appconnect}\npretransfer: %{time_pretransfer}\nredirect: %{time_redirect}\nstarttransfer: %{time_starttransfer}\ntotal: %{time_total}\n'
lookup: 0.000
connect: 0.000
appconnect: 0.000
pretransfer: 0.000
redirect: 0.000
starttransfer: 0.000
total: 0.000
So, something in HAProxy seems to be causing the slowness. The stats page reports:
system limits: memmax = unlimited; ulimit-n = 400069
maxsock = 400069; maxconn = 200000; maxpipes = 0
current conns = 64799; current pipes = 0/0; conn rate = 213/sec
Running tasks: 1/64834; idle = 48 %
The ‘max’ session count for the front end is 65k, and the limit is 200k, so I don’t think we are hitting these limits. Also, the log line quoted above shows that no time was spent waiting in queues.
What should I look at next? What might be causing this slowness?
Thanks!