HAProxy - Can't get more than 48,000 queries per second

We are trying to setup load balanced servers so we can spread the load over many servers as we grow our app. Our software is being created in Go listening as a web server.

We created a simple server in Go using PING/PONG to see how many requests we can handle per server at max. Yes, we understand that when you add in database access and all of the other processing code, that single server will not reach the same number of transactions per second. This is merely test code to ensure consistent speed on each box to exclude any outside latency sources.

We have a 2U physical box with 64 GB ram and (2) Xeon 4110 8 Core 16 Thread processors. Our Internet connection is 1 GB fiber and all servers are connected internally on virtual lan so latency shouldnt be any issue.

We have (3) Go servers setup, each with 4 GB ram and 4 CPU in a VM using Cent OS 7.

We can achieve 47,000 queries per second on each box. We would assume that if we had HAProxy in front of the boxes we should be able to hit approx 140,000 qps over the 3 servers combined.

We are using the TCP method of routing in HAProxy as it appears to be much faster than the HTTP method. We do not need to route based on URL so TCP seems to be better at the moment.

The 47,000 qps was tested using loader.io with 2000 clients spanning to 4000 clients over a 1 minute period. Each server processes the same amount of qps on average.

We setup HAProxy to connect to 1 server only to see what the speed was having it in the middle. We say about 46,000 qps when hitting the one server.

We added 2 more servers for a total of 3 behind HAProxy and there was no increase in qps, however there was load on all 3 servers spread out as indicated from watching htop on all machines. Total of 46,000 qps was all it would reach.

The HAProxy server was setup with 8 CPU and 16 GB ram and was NOT maxing out on CPU when watching htop.

There is 1 external IP address coming into HAProxy and the backends are linked via internal 10.x.x.x IP addresses, 1 per box. Each back end server also has an external IP address we used to test the speed of each server individually to ensure they all worked at 47,000~ qps.

We did increase loader.io to run 2000 thru 8000 clients per second aimed at the HAProxy server to throw more load at it, however it didn’t increase the qps at all.

It appears we have enough power to process the simple ping/pong requests in CPU, RAM and Internet traffic.

Is there a max limit that HAProxy can process per external IP based on port exhaustion? We did increase the ports on the server to 1024 - 65000.

Using watch ss -s doesn’t have any more than 10,000 ports used at maximum and there are very few in any wait status as we have set the server to reuse tcp connections as well as reduced the fin timeout.

We are simply trying to have a front end web proxy able to handle a lot of traffic and pass it off to the back end servers to process.

Our Go code currently runs at about 10,000 qps per VM so if we wanted to achieve the 140,000 qps in the above example, we would need 14 VMs to handle it. The goal is to have the capability for allot more than 140,000 and simply add more servers on the back end to handle the increase in load.

Provide the output of haproxy -vv and the configuration.

Forget loader.io for a moment. What performance do you reach through haproxy locally, without your Internet connections, firewall or NAT gateway?

Here is the config:

global
    pidfile     /var/run/rh-haproxy18-haproxy.pid
    maxconn     1000000
    user        haproxy
    group       haproxy
    daemon
    stats socket /var/opt/rh/rh-haproxy18/lib/haproxy/stats
    nbproc 8

defaults
    mode                    http
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 500000

frontend main
    bind *:80
    mode tcp
    maxconn 500000
    default_backend             app

backend app
    balance     roundrobin
    mode tcp
    maxconn 500000
    server  app1 10.0.50.9:80 check
    server  app2 10.0.50.91:80 check
    server  app3 10.0.50.92:80 check

Here is the -vv

HA-Proxy version 1.8.4-1deb90d 2018/02/08
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label
  OPTIONS = USE_LINUX_TPROXY=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
	[SPOE] spoe
	[COMP] compression
	[TRACE] trace

I posted the -vv and the config.

What did you want me to use to load test locally.

There are several that can be used.

Which tester would you like me to run, along with command line options so I can get you exactly what you are looking for.

You are saying your reach 47k qps without haproxy on a single backend app server. With haproxy, you reach less, even when you load-balance to 3 app servers.

Seems to me like you have chokepoint in front of haproxy. Do you have a firewall or NAT gateway in front of haproxy?

What is a query exactly for you? Is loader.io keep-aliving multiple request in a single TCP session? Or does it establish a new TCP session for every query?

Try apache ab or wrk. I don’t have a specific recommendations for you, all I can give you is general advice.

Appreciate the assistance.

There is a firewall running on the entrance to the server. No NAT. Each VM has a firewall as well.

I tested the wrk with and without the firewall and it was about the same speeds, no major difference.

The wrk info below is with no firewall running on any of the VMs that we are using.

This is the wrk from the HAProxy VM to one of the VM’s that reside on the same machine

Running 15s test @ http://10.0.50.91/ping
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.07ms    1.77ms  36.10ms   80.80%
    Req/Sec    24.98k     2.94k   32.67k    72.67%
  1490856 requests in 15.02s, 170.61MB read
Requests/sec:  99257.80
Transfer/sec:     11.36MB

Here is a wrk to another VM on the same machine, with 8 CPU just to see what happens with increases CPU on a backend:

Running 15s test @ http://10.0.50.9/ping
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.22ms    1.72ms  39.76ms   85.73%
    Req/Sec    32.07k     3.96k   55.28k    71.83%
  1914132 requests in 15.05s, 219.06MB read
Requests/sec: 127226.82
Transfer/sec:     14.56MB

Here is a 3rd wrk from the HAProxy, but that VM backend is on another machine right next to it with a direct connected SPF cable. This one is less in RPS probably due to it being on a different machine as latency is slightly higher.

Running 15s test @ http://10.0.50.92/ping
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.24ms    2.28ms  39.29ms   81.37%
    Req/Sec    19.33k     1.87k   24.62k    67.17%
  1154036 requests in 15.03s, 132.07MB read
Requests/sec:  76805.71
Transfer/sec:      8.79MB

We went ahead and installed the backend app on the HAProxy to test just to make sure we have connections coming in that’s comparable and we saw about 125k/sec like we did on the backend server we increased to 8cpu for testing. We ran the test from the 10.0.50.9 VM aimed at the HAProxy on 10.0.50.90, so it was getting good results on a connection direct to it.

Now here is the wrk for the HAProxy, which then connects to the 3 back end servers:

Running 15s test @ http://10.0.50.90/ping
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.63ms    2.04ms  44.89ms   83.39%
    Req/Sec    17.97k     3.10k   26.05k    66.50%
  1072975 requests in 15.03s, 122.79MB read
Requests/sec:  71395.52
Transfer/sec:      8.17MB

HAProxy only sends out 74k/sec and watching htop on the 3 back end VMs, they are no where near capacity, just wanting more requests.

Direct connection to the VM is faster than HAProxy connecting and balancing them.

We should be able to have 99k + +99k +75k from the 2 servers on the same machine, and the 1 server thru SPF cable on the other machine “theoretically”

We are not getting anywhere near that, and even a reduced amount at that.

Thoughts?

Loader.io from what we know is not keeping any connections open. They should be putting real life load on the servers from various IP addresses.