Haproxy session rate slower than single web server

I setup a haproxy(1.6.3) on ubuntu 16.04 to load balancing two web servers. From my earlier tests, the web servers can handle over 20k request/s. The web servers were tested against wrk2, and I verified number of requests in log. However, with haproxy in front of web servers, it seems that the request per second is limited to about 6k request/s. Is there anything wrong in haproxy config?

haproxy.cnf

global
    log /dev/log    local0
    log /dev/log    local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    maxconn     102400
    user haproxy
    group haproxy
    daemon

    # Default SSL material locations
    ca-base /etc/ssl/certs
    crt-base /etc/ssl/private

    # Default ciphers to use on SSL-enabled listening sockets.
    # For more information, see ciphers(1SSL). This list is from:
    # https://hynek.me/articles/hardening-your-web-servers-ssl-ciphers/
    ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS
    ssl-default-bind-options no-sslv3

defaults
    log    global
    mode    http
    option    httplog
    option    dontlognull
    # https://serverfault.com/questions/504308/by-what-criteria-do-you-tune-timeouts-in-ha-proxy-config
    timeout connect 5000
    timeout check 5000
    timeout client  30000
    timeout server  30000
    timeout tunnel  3600s
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

listen web-test
    mode http
    bind *:80
    balance roundrobin
    option forwardfor
    option http-keep-alive  # connections will no longer be closed after each request
    server test1 SERVER1:80 check maxconn 20000
    server test2 SERVER2:80 check maxconn 20000

If runnign wrk with 3 instances, I get approximately the same result:

./wrk -t4 -c100 -d30s -R4000 http://HAPROXY/
Running 30s test @ http://HAPROXY/
  4 threads and 100 connections
  Thread calibration: mean lat.: 1577.987ms, rate sampling interval: 7139ms
  Thread calibration: mean lat.: 1583.182ms, rate sampling interval: 7180ms
  Thread calibration: mean lat.: 1587.795ms, rate sampling interval: 7167ms
  Thread calibration: mean lat.: 1583.128ms, rate sampling interval: 7147ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.98s     2.67s   13.93s    58.43%
    Req/Sec   516.75     11.28   529.00     87.50%
  64916 requests in 30.00s, 51.69MB read
Requests/sec:   2163.75    # Requests/sec decrease slightly
Transfer/sec:      1.72MB

Stats from haproxy:

If running wrk with 1 instance to one of the web server without haproxy:

./wrk -t4 -c100 -d30s -R4000 http://SERVER1
Running 30s test @ http://SERVER1
  4 threads and 100 connections
  Thread calibration: mean lat.: 1.282ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.363ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.380ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.351ms, rate sampling interval: 10ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.41ms    0.97ms  22.42ms   96.48%
    Req/Sec     1.05k   174.27     2.89k    86.01%
  119809 requests in 30.00s, 98.15MB read
Requests/sec:   3993.36     # Requests/sec is about 4k
Transfer/sec:      3.27MB

haproxy -vv
HA-Proxy version 1.6.3 2015/12/25
Copyright 2000-2015 Willy Tarreau willy@haproxy.org

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
  OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2g-fips  1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

HA-Proxy version 1.6.3 2015/12/25
Copyright 2000-2015 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
  OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2g-fips  1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

I know that ab is not a very precise way to test this, but I thought haproxy should give a better result than a single node. However, the results show the opposite.

ab test HAPROXY

ab -n 10000 -c 10 http://HAPROXY/
Requests per second:    4276.18 [#/sec] (mean)

ab test SERVER1

ab -n 10000 -c 10 http://SERVER1/
Requests per second:    9392.66 [#/sec] (mean)

ab test SERVER2

ab -n 10000 -c 10 http://SERVER2/
Requests per second:    8513.28 [#/sec] (mean)

The VM is single core, so there is no need use nbproc. Plus, I monitor the cpu, memory usage, all VMs use less then 30% cpu, and 20% memory. There must be something wrong about the haproxy configs or my system configs.

You didn’t set a maxconn value in the listen section, therefor it’s capped at 2000 (the servers within the listen section are 20000 each, but the entire listen section is 2000).

Please set maxconn to something like 20000 or 40000 in the listen section.

I add maxconn to listen section, and I now have about the same performance on both single server and haproxy.

listen web-test
    maxconn 40000 
    mode http
    bind *:80
    balance roundrobin
    option forwardfor
    option http-keep-alive  # connections will no longer be closed after each request
    server test1 SERVER1:80 check maxconn 20000
    server test2 SERVER2:80 check maxconn 20000

However, I am curious about the performance here. Say that the haproxy has enough cpu and memory resource. Shouldn’t I be expected twice qps when having two servers behind haproxy? And about quadratic performance when having four servers servers behind haproxy? Is there any rule of thumb or benchmark results to this topic?

You should. However you are on a VM, any kind of serious benchmarking is impossible anyway and you don’t know what limits your benchmark tools themself have.

For any serious testing:

  • you need physical boxes, not VMs
  • the client and the server need to run on different boxes, or at least they should be fixed on specific, different cores
  • you need to use keep-alive mode, otherwise the connection setup become too expensive to compare it to direct backend performance (for ab its -k iirc)

Though I say VM, I actually mean ec2 machines on aws. So these are all different machines on ec2, and I still expect better performance with haproxy in front of multiple web servers. You mention that I should try keep-alive mode when testing with ab, however, the result is about the same. I still have the same performance for both haproxy and single node web server. Does it make sense to test the performance by only serving static html file?

Of course, and I can confirm that haproxy can handle the load. 10 years ago haproxy 1.4-dev handled 40000 connections per second [1].

The way you describe the issue makes me think that your backend itself would be able to handle more connections as well, and that some chokepoint limits both backend and haproxy performance.

Now, this could be:

  • iptables on any of those instances
  • any middle-boxes between (is there a NAT layer in between?, any firewalling?)
  • the actual network
  • interrupt/CPU load on the instance or hypervisor
  • the ab and wrk tool itself

I’d suggest to spawn a new ec2 instance and run the test client on both ec2 “client” instances simultaneously. If you see the same aggregated load, then at least you know the problem is not your test method/tool.

A single html file should be fine to test, as long as you are aware of the connection setup tax.

I’d also suggest you try Willy’s inject tool (instead of ab), you can git clone it from:
git clone http://git.1wt.eu/git/inject.git/

[1] http://www.haproxy.org/10g.html

First of all, I switch to the latest haproxy 1.8.3

haproxy.cfg

global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
        stats timeout 30s
        user haproxy
        group haproxy
        daemon
        maxconn 100000

ca-base /etc/ssl/certs
crt-base /etc/ssl/private

ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS
ssl-default-bind-options no-sslv3

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        timeout connect 5000
        timeout client  50000
        timeout server  50000
        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http

listen web-test
    maxconn 40000
    mode http
    bind *:80
    balance roundrobin
    option forwardfor
    option http-keep-alive  # connections will no longer be closed after each request
    server w1 172.31.19.219:80 check maxconn 20000
    server w2 172.31.23.64:80 check maxconn 20000

listen stats
    bind *:9000  # Listen on localhost:9000
    mode http
    stats enable  # Enable stats page
    stats hide-version  # Hide HAProxy version
    stats realm Haproxy\ Statistics  # Title text for popup window
    stats uri /haproxy_stats  # Stats URI

haproxy -vv

HA-Proxy version 1.8.3-1ppa1~xenial 2018/01/02
Copyright 2000-2017 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
	[SPOE] spoe
	[COMP] compression
	[TRACE] trace
  • no iptables, not using it
  • no firewall, no NAT layer for now
  • ping is small

I tried inject, but I cannot understand why it soon stucks at 200 maxobj. I change the maxobj because it seems to be related with session rate I get.

./inject -H "Host: www" -T 1000 -G "HAIP:80/" -o 200 -u 1

   hits ^hits hits/s  ^h/s     bytes  kB/s  last  errs  tout htime  sdht ptime
    200   200    195   195    169000   165   165     0     0 322.4 20.9 360.0
    400   200    200   204    338000   169   172     0     0 310.3 47.5 372.0
    404     4    134     4    341380   113     3     0     0 237.2 37.2 372.0
    600   196    150   196    507000   126   165     0     0 325.0 20.1 363.0
    800   200    160   200    676000   135   169     0     0 311.4 36.2 365.0
   1000   200    166   200    845000   140   169     0     0 309.4 43.9 363.0
   1000     0    142     0    845000   120     0     0     0 0.0 0.0 0.0
   1200   200    150   200   1014000   126   169     0     0 319.8 28.0 363.0
   1400   200    155   200   1183000   131   169     0     0 321.1 29.4 364.0
   1600   200    160   200   1352000   135   169     0     0 325.2 26.9 375.0
   1600     0    145     0   1352000   122     0     0     0 0.0 0.0 0.0
   1800   200    150   200   1521000   126   169     0     0 312.8 40.6 363.0
   2000   200    153   200   1690000   130   169     0     0 320.5 31.5 364.0
   2162   162    154   162   1826890   130   136     0     0 310.2 16.8 364.0
   2200    38    146    37   1859000   123    32     0     0 350.7 3.7 360.0

For anyone who wants to try it, you might have to modify the Makefile for your own platform. If you happen to use ubuntu, you can change it like this.

CPU      = -march=x86-64   
CPU_OPTS = -mpreferred-stack-boundary=8 -falign-functions=1 -falign-loops=1 -falign-jumps=1

I switch back to wrk to test a single web server again. I launch multiple clients, and some clients do not send 3k req/s as expected.

./wrk -t4 -c100 -d30s -R3000 http://WEBSERVER      # 4 threads, 100 conns, 30sec, 3000req
Requests/sec:   2995.41

I add clients one at a time to keep all clients sending 3k req/s, and now it looks like that the web servers can handle 17k req/s. If I add more clients, the performance starts to degrade. So I think this is the baseline of a single web servers, and it is 17k req/s. (Though I later realize that by keep more open connections, I might get a higher connections.)

I start testing haproxy with multiple clients again, and the more tests I did I get confused about the results.

haproxy cpu usage about 60%
./wrk -t4 -c100 -d30s -R3000 http://HAPROXY # with 10 clients, about 0.8 - 2.0k req/s
Screenshot from 2018-01-04 13-34-26

haproxy cpu usage about 50%
./wrk -t4 -c200 -d30s -R3000 http://HAPROXY # with 4 clients, about 2.4-2.6k req/s
Screenshot from 2018-01-04 13-17-36

haproxy cpu usage about 50-55%
./wrk -t4 -c200 -d30s -R5000 http://HAPROXY # with 3 clients, about 3.2-3.4k req/s
Screenshot from 2018-01-04 13-28-11

haproxy cpu usage about 60%
./wrk -t4 -c300 -d30s -R4000 http://HAPROXY # with 3 clients, about 3.6-3.8k req/s
Screenshot from 2018-01-04 13-41-24

I try to tweak the open connections because it seems to be the same as keep-alive and the number is correlated to the current conns on haproxy stats page. I thought I could get higher session rate by keeping more open connections, but the results was not what I expected.

A part from the userspace CPU load due to haproxy, what system load do you see and how much CPU load is really “free”? Usually system CPU load amounts for quite a share, due to the amount of work the CPU has to do.

I don’t have a perfect explanation for you here. But not testing on physical boxes is a major contributing factor to performance issues such as this one. You don’t know when and how Amazon rate-limits or slows down (CPU, RAM, network-bw, etc).

I wasn’t pay too much attention to system load when doing tests, so I guess I will just do more tests later. Thanks for the help!