I’ve been doing some load testing in anticipation of our busy season. I’m able to push through 800 connections per second to a single backend without much problem. When I add additional
servers to the backend pool, and increase the load on the haproxy machine to 3000-4000 connections per second, haproxy starts marking my backends as down, even though I’ve increased my backend capacity 10x, and only 4x-5x’ed my load.
Running: watch -n1 'curl --silent "http://localhost:9000/haproxy_stats;csv" | cut -d "," -f 2,5,37'
Shows backends going from L7OK -> *L7OK (what does the * mean?) -> L4TOUT -> Healthy. The downstream server is always healthy, and tcp connections from another machine succeed without any issue.
CPU on the haproxy machine I’m using is fine (an r4.large EC2 instance), as is memory usage. The machine is maxing out at 50MBps bandwidth usage, which is a 1/10th of what we push in production, so I’m not bandwidth limited.
Any ideas why haproxy isn’t able to successfully complete a healthcheck? Any suggestions on debugging this?
Here is my haproxy.cfg:
global
maxconn 60000
user haproxy
group haproxy
daemon
tune.ssl.default-dh-param 2048
defaults
mode http
retries 3
option redispatch
timeout connect 5000
timeout client 50000
timeout server 50000
resolvers dns
nameserver dns 169.254.169.253:53
backend server_pool
balance hdr(x-mi-cbe)
option httpchk GET /?health=true
server a 10.2.2.183:9292 check inter 5000
server b 10.2.1.184:9292 check inter 5000
server c 10.2.2.174:9292 check inter 5000
server d 10.2.0.41:9292 check inter 5000
server e 10.2.0.16:9292 check inter 5000
server f 10.2.2.216:9292 check inter 5000
server g 10.2.1.135:9292 check inter 5000
server h 10.2.0.162:9292 check inter 5000
server i 10.2.1.232:9292 check inter 5000
server j 10.2.1.253:9292 check inter 5000
server k 10.2.2.141:9292 check inter 5000
backend stats
balance roundrobin
stats enable
frontend cors_proxy_http
bind 0.0.0.0:80
monitor-uri /haproxy?health
default_backend server_pool
option forwardfor
option http-server-close
maxconn 60000
log /var/lib/haproxy/dev/log local2 debug
option httplog
option dontlognull
frontend stats
bind 0.0.0.0:9000
monitor-uri /haproxy?health
default_backend stats
stats uri /haproxy_stats
Check iptables and kernel logs. Maybe you hit a conntrack limit - on the haproxy box - or the actual backend server (maybe you have per-source IP limits deployed?).
If that’s not it, make sure haproxy logging works and post its output (you will see healthcheck log messages with more details), also provide the output of “haproxy -vv”.
Any issues without health checks?
We disabled conntrack’ing on the haproxy machine. No packet loss on the backend machines (need to get to around 2800 connections per second, we’ve load tested for that).
We don’t have iptables restrictions or per-source IP limits on either the haproxy machine or the backends.
Here is haproxy -vv:
haproxy -vv -f /etc/haproxy/haproxy.cfg
HA-Proxy version 1.7.9-1ppa1~trusty 2017/08/19
Copyright 2000-2017 Willy Tarreau <willy@haproxy.org>
Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 USE_NS=1
Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.31 2012-07-06
Running on PCRE version : 8.31 2012-07-06
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with network namespace support
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available filters :
[COMP] compression
[TRACE] trace
[SPOE] spoe
No other issues besides the health checks (which cause decreased throughput).
So by disabling the health check, performance is fine?
Like I said, post the haproxy log, it will contain the reason the health check failed.
The log is very verbose because of the traffic I’m putting through it. What can I grep for? Is there anything I need to do to enable health check logging? I haven’t seen it.
health checks are always be logged, no need to enable anything.
If its to much output, disable everything else in the frontend:
option dontlog-normal
option dontlognull
Ok, thanks Lukas! I’ll spin my load testing stack back up and report back.
@lukastribus sorry for the delay. I turned on option dontlog-normal, but still get pretty noisy logs. I also don’t see healthchecks. Here’s an example of a log line I see (which seems normal to me. 200 status code).
Oct 5 14:48:24 colony haproxy[6307]: 10.2.1.112:59228 [05/Oct/2017:14:48:17.852] frontend backend/A 0/0/6031/282/6411 200 968 - - ---- 1805/1805/1356/99/1 0/0 {bar} “GET /www.google.com?foo=12868837 HTTP/1.1”
(I replaced the names of some of the servers).