we already reported this right after the release of 1.9 and my former colleague Ashwin N. had a longer communication with Willy about that.
We observed significantly increased latency introduced by HAProxy 1.9 once the requests-per-second (RPS) increases beyond 40-50RPS. After the release of 2.0 we were hoping that the issue may have been resolved, but unfortunately, it is still there.
Here is our observation:
we upgraded from 1.8.17 to 2.0.2
we did not change the config file
we started 2.0.2 using master/worker mode (-Ws and -S command line parameter)
we have typically between 8 and 10 backends but some backends can have 1000+ servers
we use HAProxy both as ingress as well as egress for our services, i.e. requests to the service will hit HAProxy on an HTTPS endpoint which will then connect to our service via localhost:8080. Requests to downstream services will be made to service.localhost:80 which HAProxy will then resolve via a host acl to the corresponding backend. Connections to backend servers are using HTTPS.
we are using roundrobin load balancing scheme
In our environment, we started observing significantly increased p99 latency with 1.9 and 2.0.2. Our service ānormallyā reports a p99 response time to a downstream service with 600ms. After the upgrade to HAProxy, this p99 latency went up to 1200ms (yes, 1.2 seconds!!). So, it appears that HAProxy 2.0.2 introduces extra latency of up to 600ms! The āaverageā processing time is measured as 55ms and with HAProxy 2.0.2 it went up to 68ms. So the average latency increase was 13ms!
Also, important to mention: we run about 80 instances of our service in parallel. To test HAProxy 2.0.2, we have deployed the new version to two instances and ran 2.0.2 and 1.8.17 in parallel so we could directly compare the metrics coming from both versions.
This latency increase cannot be seen in low RPS environments below about 40-50RPS and 1.8 and 1.9/2.0 seem to behave exactly the same.
This is a HUGE blocker for us to move from our current HAProxy 1.8.x to the latest since we need to take advantage of the newly introduced connection pooling to backend servers.
Again, Willy should already have enough details from our previous interaction. He mentioned he wanted to look into that and hopefully find a fix but we never heard backā¦
Any help would be greatly appreciated!
Thank you
P.S. I will be out of the country for one week starting tomorrow without internet connection. So, I will be unable to respond to questions before that.
@lukastribus I recompiled HAProxy 1.9.9 with the suggested DEFINE. Here is the output from haproxy -vv:
haproxy -vv
HA-Proxy version 1.9.9-1.box.el7 2019/07/23 - https://haproxy.org/
Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits -DMAX_READ_POLL_LOOPS=1
OPTIONS = USE_LINUX_TPROXY=1 USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1
Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.1.1c 28 May 2019
Running on OpenSSL version : OpenSSL 1.1.1c 28 May 2019
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTX side=FE|BE
h2 : mode=HTTP side=FE
<default> : mode=HTX side=FE|BE
<default> : mode=TCP|HTTP side=FE|BE
Available filters :
[SPOE] spoe
[COMP] compression
[CACHE] cache
[TRACE] trace
So, how exactly can I confirm that the change is actually in the new compiled binary?
By the way, uname -a is reporting as
Linux <hostname> 3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
I have rebuilt HAProxy 1.9.9 with the suggested flag and can confirm now that it seem to have resolved the extreme latency at the ātail-endā.
Two observations and follow-up questions, though: it appears that HAProxy 1.9 (we are running 1.9.9 right now) seems to have a slightly higher average CPU utilization (about 5-10% more than 1.8). Also, the memory consumption seems about 10%-15% higher. Would this be in-line with the expectations due to changes in 1.9?
Nothing immediately comes to my mind. Maybe the idle connections to servers consume a part of this memory. It is also possible that for scheduling reasons some data lie a bit longer in buffers sometimes.
However Iām very concerned about the fact that the flag has fixed your issue at such a low RPS. Or maybe the responses are huge and could also explain the extremely high service times ? Anyway I really canāt explain how a single syscall can induce 600 extra ms of service time, this doesnāt make sense. I rather suspect that it has an indirect impact on how events are polled or reported and could actually hide another issue. But I canāt guess what :-/