Strange timeout behavior during load testing

I’ve been doing some load testing in anticipation of our busy season. I’m able to push through 800 connections per second to a single backend without much problem. When I add additional
servers to the backend pool, and increase the load on the haproxy machine to 3000-4000 connections per second, haproxy starts marking my backends as down, even though I’ve increased my backend capacity 10x, and only 4x-5x’ed my load.

Running: watch -n1 'curl --silent "http://localhost:9000/haproxy_stats;csv" | cut -d "," -f 2,5,37'

Shows backends going from L7OK -> *L7OK (what does the * mean?) -> L4TOUT -> Healthy. The downstream server is always healthy, and tcp connections from another machine succeed without any issue.

CPU on the haproxy machine I’m using is fine (an r4.large EC2 instance), as is memory usage. The machine is maxing out at 50MBps bandwidth usage, which is a 1/10th of what we push in production, so I’m not bandwidth limited.

Any ideas why haproxy isn’t able to successfully complete a healthcheck? Any suggestions on debugging this?

Here is my haproxy.cfg:

global
  maxconn 60000
  user haproxy
  group haproxy
  daemon
  tune.ssl.default-dh-param 2048

defaults
  mode http
  retries 3
  option redispatch
  timeout connect 5000
  timeout client 50000
  timeout server 50000

resolvers dns
    nameserver dns 169.254.169.253:53
backend server_pool
  balance hdr(x-mi-cbe)
  option httpchk GET /?health=true

  server a 10.2.2.183:9292 check inter 5000
  server b 10.2.1.184:9292 check inter 5000
  server c 10.2.2.174:9292 check inter 5000
  server d 10.2.0.41:9292 check inter 5000
  server e 10.2.0.16:9292 check inter 5000
  server f 10.2.2.216:9292 check inter 5000
  server g 10.2.1.135:9292 check inter 5000
  server h 10.2.0.162:9292 check inter 5000
  server i 10.2.1.232:9292 check inter 5000
  server j 10.2.1.253:9292 check inter 5000
  server k 10.2.2.141:9292 check inter 5000

backend stats
  balance roundrobin
  stats enable
frontend cors_proxy_http
  bind 0.0.0.0:80 
  monitor-uri /haproxy?health
  default_backend server_pool
  option forwardfor
  option http-server-close
  maxconn 60000
  log /var/lib/haproxy/dev/log local2 debug
  option httplog
  option dontlognull

frontend stats
  bind 0.0.0.0:9000 
  monitor-uri /haproxy?health
  default_backend stats
  stats uri /haproxy_stats

Check iptables and kernel logs. Maybe you hit a conntrack limit - on the haproxy box - or the actual backend server (maybe you have per-source IP limits deployed?).

If that’s not it, make sure haproxy logging works and post its output (you will see healthcheck log messages with more details), also provide the output of “haproxy -vv”.

Any issues without health checks?

We disabled conntrack’ing on the haproxy machine. No packet loss on the backend machines (need to get to around 2800 connections per second, we’ve load tested for that).

We don’t have iptables restrictions or per-source IP limits on either the haproxy machine or the backends.

Here is haproxy -vv:

haproxy -vv -f /etc/haproxy/haproxy.cfg
HA-Proxy version 1.7.9-1ppa1~trusty 2017/08/19
Copyright 2000-2017 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.31 2012-07-06
Running on PCRE version : 8.31 2012-07-06
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with network namespace support

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
	[COMP] compression
	[TRACE] trace
	[SPOE] spoe

No other issues besides the health checks (which cause decreased throughput).

So by disabling the health check, performance is fine?

Like I said, post the haproxy log, it will contain the reason the health check failed.

The log is very verbose because of the traffic I’m putting through it. What can I grep for? Is there anything I need to do to enable health check logging? I haven’t seen it.

health checks are always be logged, no need to enable anything.

If its to much output, disable everything else in the frontend:

option dontlog-normal
option dontlognull

Ok, thanks Lukas! I’ll spin my load testing stack back up and report back.

@lukastribus sorry for the delay. I turned on option dontlog-normal, but still get pretty noisy logs. I also don’t see healthchecks. Here’s an example of a log line I see (which seems normal to me. 200 status code).

Oct 5 14:48:24 colony haproxy[6307]: 10.2.1.112:59228 [05/Oct/2017:14:48:17.852] frontend backend/A 0/0/6031/282/6411 200 968 - - ---- 1805/1805/1356/99/1 0/0 {bar} “GET /www.google.com?foo=12868837 HTTP/1.1”

(I replaced the names of some of the servers).