Large Conntrack/Active connections count. FD Limit Reached

rieger.jared · February 23, 2021, 12:54pm

Hey all, I’m currently trying to migrate my servers from NGINX to HAProxy but on restarting the proxies with the new configuration, the conntrack and active connection count skyrockets to around 600k/20k respectively. I’ve been looking at this issue for a week and I have no idea how to proceed. I’ve looked at tcpdumps and other tools like ss but I honestly don’t know what to look for. The logs don’t really show anything. I haven’t tried yet to set them to a verbose mode as they generate so much garbage. Usually, Conntrack is hanging around 15k per server. also what is odd is that if one haproxy reloads the other proxies also spike around 600k in conntrack. what TH could be happening? Thanks for the help
config: global daemon maxconn 50000 user haproxy group haproxy - Pastebin.com (edited)

lukastribus · February 23, 2021, 8:26pm

http-response del-header Connection

You are interfering with haproxy’s connection handling. Don’t do that. I know those crazy hacks (overwriting connection handling headers) are considered normal in the nginx world, but that is definitely not the case with haproxy.

http-response set-header Connection close if exceeded_connection reset

Here too, don’t do this. If you believe you need get crazy with connection headers later on, I will probably not be able to stop you, but please get your baseline numbers first without it. Your mileage will certainly vary, if you choose to do so.

Please provide the output of haproxy -vv and from the ss output, try to understand if there is a pattern, like, are most of the sockets in CLOSE_WAIT state? Are most of the sockets between the haproxy and the backend server, or between haproxy and the clients? Things like that could help narrow down the root cause.

rieger.jared · March 8, 2021, 7:41am

Hi @lukastribus, Thank you for getting back to me. sorry I only saw this now.

Thank you for your advise on the connection header handling. This is the first haproxy instance we’re running so we were trying to emulate only NGINX set up. I’ll remove it and see how it performs.

I’m currently running HAProxy in Docker. here is a version list.

Haproxy: haproxy:2.3.4-alpine
Docker: Docker version 17.04.0-ce, build 4845c56
host: Ubuntu 12.04.5 LTS

output of haproxy -vv in image

Status: stable branch - will stop receiving fixes around Q1 2022.
Known bugs: http://www.haproxy.org/bugs/bugs-2.3.4.html
Running on: Linux 3.13.0-117-generic #164~precise1-Ubuntu SMP Mon Apr 10 16:16:25 UTC 2017 x86_64
Build options :
  TARGET  = linux-musl
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1
  DEBUG   =

Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED -BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=24).
Built with OpenSSL version : OpenSSL 1.1.1i  8 Dec 2020
Running on OpenSSL version : OpenSSL 1.1.1i  8 Dec 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.6
Built with network namespace support.
Built with the Prometheus exporter as a service
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.36 2020-12-04
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 10.2.1 20201203

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services : prometheus-exporter
Available filters :
	[SPOE] spoe
	[CACHE] cache
	[FCGI] fcgi-app
	[COMP] compression
	[TRACE] trace

Thank for the ss tip. I’ll look at the output when the issue occurs again. In the last couple of weeks, I have noticed a pattern. If I restart the proxies metrics like conntrack_count and active connections are low and are at a reasonable level[1]. The issue pops up when a backend server crashes. After this happens these two metrics are several orders of magnitude higher than normal(see time prior to 6am [1]). These large counts are always preceded by a crashing server in the backend and a large connection spike[2] and a bunch of TCP errors[3].

I’ll have a look again at the ss output and see what I can find there. It appears to me(not sure though) that once an old server crashes, the stale connections in haproxy are not cleaned up. Although that wouldn’t explain why the conntrack count is so high on the machines.

Figures

figure 1

figure 2

Figure 3

lukastribus · March 8, 2021, 10:41pm

Please share the configuration you are currently using, including the server parameters and health checks.

Also share haproxy logs at the time of the server crash.

rieger.jared · March 9, 2021, 9:29am

Here is the configuration file.

global
    daemon
    maxconn         150000
    user            haproxy
    group           haproxy
    log             127.0.0.1:514 local0 notice
    stats socket /var/run/haproxy.sock expose-fd listeners

    # each conn is around 200byte, thus we reserve 200mb for ssl caching
    # below we allow 1,000,000 connections to be cached
    tune.ssl.cachesize 1000000

    nbproc 1
    nbthread 22
    cpu-map auto:1/1-22 0-21
    master-worker

defaults
    log             global
    mode            http
    option          httplog
    option          dontlognull
    timeout         connect  5s
    timeout         check    5s
    timeout         client   30s
    timeout         server   30s

    timeout         http-keep-alive 10s
    option          http-keep-alive

frontend stats
    bind            <%= scope.function_interface_by_tag(['public', 'address']) %>:8999
    bind            *:8999
    mode            http
    stats           enable
    stats           uri /


frontend site
    maxconn         25000
    bind            *:9022 ssl crt /etc/ssl/private/haproxy.pem alpn h2,http/1.1
    mode            http

    stick-table type string size 10k store gpc0

    http-request set-var(sess.src_port) src_port
    http-request set-var(sess.source) src,concat(:,sess.src_port)

    http-request track-sc0 var(sess.source)
    http-request sc-inc-gpc0

    acl exceeded_connection sc0_get_gpc0 ge 10000
    acl reset sc0_clr_gpc0 ge 0
    http-response set-header Connection close if exceeded_connection reset

    acl is_authorized hdr(Authorization) token

    http-request deny if !is_authorized

    default_backend site

backend site
    balance         roundrobin
    http-reuse      always
    mode            http
    option          tcp-check

    option srvtcpka
    srvtcpka-intvl 10s
    srvtcpka-cnt 3

    <%- for i in 1..36 -%>
    server node-<%= i.to_s.rjust(2, '0') %> node-<%= i.to_s.rjust(2, '0') %> check port 3030 weight 100 alpn http/1.1
    <%- end -%>


frontend site
    maxconn         25000
    bind            *:9031
    mode            http

    stick-table type string size 10k store gpc0

    http-request set-var(sess.src_port) src_port
    http-request set-var(sess.source) src,concat(:,sess.src_port)

    http-request track-sc0 var(sess.source)
    http-request sc-inc-gpc0

    acl exceeded_connection sc0_get_gpc0 ge 10000
    acl reset sc0_clr_gpc0 ge 0
    http-response set-header Connection close if exceeded_connection reset

    default_backend site

backend site
    balance         roundrobin
    http-reuse      always
    mode            http
    option          tcp-check

    option srvtcpka
    srvtcpka-intvl 10s
    srvtcpka-cnt 3

    <%- for i in 1..36 -%>
    server node-<%= i.to_s.rjust(2, '0') %> node-<%= i.to_s.rjust(2, '0') %> check port 3030 weight 100 alpn http/1.1
    <%- end -%>


frontend site
    maxconn         40000
    bind            *:9042 ssl crt /etc/ssl/private/haproxy.pem
    mode            http

    stick-table type string size 10k store gpc0

    http-request set-var(sess.src_port) src_port
    http-request set-var(sess.source) src,concat(:,sess.src_port)

    http-request track-sc0 var(sess.source)
    http-request sc-inc-gpc0

    acl exceeded_connection sc0_get_gpc0 ge 10000
    acl reset sc0_clr_gpc0 ge 0
    http-response set-header Connection close if exceeded_connection reset

    default_backend site

backend site
    balance         roundrobin
    http-reuse      always
    mode            http

    option httpchk GET /health
    http-check expect status 200

    option srvtcpka
    srvtcpka-intvl 10s
    srvtcpka-cnt 3


    <%- for i in 1..36 -%>
    server node-<%= i.to_s.rjust(2, '0') %> node-<%= i.to_s.rjust(2, '0') %> check port 3030 weight 100 alpn http/1.1
    <%- end -%>


frontend site
    maxconn         25000
    bind            *:9091 ssl crt /etc/ssl/private/haproxy.pem
    mode            http

    stick-table type string size 10k store gpc0

    http-request set-var(sess.src_port) src_port
    http-request set-var(sess.source) src,concat(:,sess.src_port)

    http-request track-sc0 var(sess.source)
    http-request sc-inc-gpc0

    acl exceeded_connection sc0_get_gpc0 ge 10000
    acl reset sc0_clr_gpc0 ge 0
    http-response set-header Connection close if exceeded_connection reset

    default_backend site

backend site
    balance         roundrobin
    http-reuse      always
    mode            http
    option          tcp-check

    option srvtcpka
    srvtcpka-intvl 10s
    srvtcpka-cnt 3

    <%- for i in 1..36 -%>
    server node-<%= i.to_s.rjust(2, '0') %> node-<%= i.to_s.rjust(2, '0') %> check port 3030 weight 100 alpn http/1.1
    <%- end -%>

Errors in the logs are solely just copies of this error

Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "
Mar  9 07:48:02 127.0.0.1 haproxy: Proxy [frontend] reached process FD limit (maxsock=300398). Please check 'ulimit-n' and restart. {} " http:// "

Lastly, I should say there is currently an open ticket in Github that also seems to demonstrate this issue as well. https://github.com/haproxy/haproxy/issues/136 and Backend connection leak after connection failures · Issue #1003 · haproxy/haproxy · GitHub.

rieger.jared · March 9, 2021, 10:10am

Here is the output of ss -a | awk '{print $1}' | sort | uniq -c

 266412 ESTAB
      1 FIN-WAIT-1
     54 FIN-WAIT-2
     21 LISTEN
      1 State
      2 SYN-RECV
     34 SYN-SENT
   1860 TIME-WAIT

lukastribus · March 11, 2021, 2:05pm

The operating system is completely obsolete and the kernel is at least 4 years old. Considering that we are talking about a socket issue, the kernel may very well play a role here. I strongly suggest you upgrade the operating system to a supported one.

You are also missing 4 years of security fixes, so this is something you will have to do anyway.

I don’t think the github issues are related, #136 was fixed in 2.0.5 and the other one I’m not sure I see the same issue.

rieger.jared · March 11, 2021, 2:22pm

@lukastribus Unfortunately I have the constraint that I am not allowed to perform a system os upgrade. However, I did solve my issue by downgrading my haproxy to 2.1. Not sure why this helps but it solved my problem in the meantime. Thank you for your help

Topic		Replies	Views
Haproxy 1.8 stuck - 80% dropped connections Help!	10	1426	January 2, 2019
Understanding maxconn and maxonnrate and delays Help!	12	7219	February 4, 2019
Haproxy cause to full nf_conntrack_max Help!	0	857	November 1, 2020
Haproxy / Iptables / VPS : works some minutes then --> Error 503 Help!	14	1587	July 15, 2017
Connections stopping after 2,000 connections Help!	1	5563	August 11, 2017

Large Conntrack/Active connections count. FD Limit Reached

Figures

Related topics