Tcp-check leaves lots of TIME_WAITs

Typical tcp-check in backend section:

backend backend_redis_write
  mode tcp
  option tcp-check
  tcp-check connect
  tcp-check send AUTH\ password\r\n
  tcp-check expect string +OK
  tcp-check send info\ replication\r\n
  tcp-check expect string role:master
  tcp-check send QUIT\r\n
  tcp-check expect string +OK
  server server1 server1:6379 check inter 1s on-marked-down shutdown-sessions on-marked-up shutdown-backup-sessions
  server server2 server2:6379 check inter 1s on-marked-down shutdown-sessions on-marked-up shutdown-backup-sessions
  server server3 server3:6379 check inter 1s on-marked-down shutdown-sessions on-marked-up shutdown-backup-sessions

With the config above Haproxy host has lots of sockets in TIME_WAIT state. I wonder if it’s possible to use persistent connections for this and not open new connection on next tcp-check sequence run?

Regards,

It seems that Haproxy sends FIN first, not waiting until server does this. This explains why we have TIME_WAITs.

TIME_WAIT’s are not a problem per-se, unless you run out of source-ports.

Make sure you enable net.ipv4.tcp_tw_reuse, but don’t enable net.ipv4.tcp_tw_recycle.

That’s not supported for HTTP and pretty much impossible for TCP health checks, as we would have to implement a syntax that is able to describe proprietary TCP level transactions.

But like I said a high number of TIME_WAITs are a non-problem. Running out of source-ports is a problem, but I assume that’s not happening here.

I’m aware of tw_reuse (it’s turned ON). tw_recycle is evil (and removed in modern Linux kernels).

Just to clarify the problem: if I set up external script that does almost the same check with help of netcat - I get perfect scenario when _Redis server sends FIN first, not Haproxy, so TIME_WAIT states are almost eliminated.

The problem is that Haproxy (as client) sends FIN packet first when check phase is finished. Not waiting for FIN from backend (Redis etc.).

Is there any way to ‘sleep’ or ‘delay’ sending FIN packet from Haproxy so remote side will be able to send FIN packet first?

P.S. Having lots of TIME_WAIT at server’s side - not a problem. But having them at client’s side - get ready for errors like: “Cannot bind to source address before connect() for backend xxx. Aborting.”

That’s what I said: unless you run out of source ports, it’s not a problem.

You health check every second, so I’d assume you have about 60 sockets in TIME_WAIT per server. I don’t see how you’d run out of source ports realistically.

I’m not aware of an option to delay the health checks.

Unfortunately, there’re lots of checks, much more than 60. For each backend it counts up to 7-8K of TIME_WAITs as I have several ports on backends to check. Anyway, thank you for clarification with current state of heathchecks.

What I meant was 60 sockets in TIME_WAIT state per haproxy declared server (which is per destination IP + per port).

Source port exhaustion happens based on the 5 tuple (minus the source port obviously), therefor you will never run out of source ports in this case. You are using 60 source ports out of 65000 for each destination IP/port combination.

Whether you have a total of 8k or 2 million of sockets in TIME_WAIT does not really matter for source port exhaustion.

Each haproxy process makes its own checks. So having only 64K+ source ports I could quickly become ouf of available source ports for single IP address.

P.S.Could please explain why did you say about 5 tuple (4 tuple may be?) and exclude source port? For my understanding it’s src_port+src_ip+dst_ip+dst_port 4 tuple. Last 3 items are always the same.

Sure, that’s one of the reasons why multi-threading was implemented, so that if you really have 128 CPU cores, only a single health check per backend server occurs as opposed to the number of processes (which remains 1).

You are making arbitrary statements, without providing real or exemplary numbers. How do you come to this conclusion?

The fifth factor is the protocol, which in this case is always TCP.

Why would the dst_ip and dst_port always be the same?

  1. Numbers of TIME_WAIT ports for only one backend config listed in initial post:
    14513 server1:6379
    14505 server2:6379
    14481 server3:6379
    There’s only one redis instance on each server.

  2. dst_ip and dst_port will always be the same for each server, true.

The main problem (I forgot to mention) is that I have Nginx running on the same node. Looks as it makes bind() call using 0.0.0.0 as an address when proxy_bind command is used, so it requires unused port to make outgoing connection. When number of outgoing connections are high then I could see errors like ‘Unable to bind address’ in error.log.

The only possible solution I could see so far is to use different outgoing IPs for Haproxy and Nginx. But the question is still open - why Haproxy doesn’t wait awhile until remote server closes the connection?

What needs to be done here is troubleshoot the high amount of sockets in TIME_WAIT. Like I said earlier, you should see about 60 sockets in TIME_WAIT per backend server (per haproxy process, if you have multiple).

60 sockets per backend server can be explained by the fact that haproxy closes first. 14000 per backend server cannot, unless you have nbproc 256.

Please share the entire configuration, including default and global section, as well as the output of:

  • haproxy -vv
  • uname -a
  • cat /proc/sys/net/ipv4/tcp_fin_timeout

So that is the real problem you are troubleshooting…

This nginx configuration is likely causing a mess. You need IP_BIND_ADDRESS_NO_PORT support to make this reliable. So you need at least:

  • linux kernel 4.2
  • libc 2.23
  • nginx 1.11.4

If any one of those releases is older, than you will have source port exhaustion on nginx when using proxy_bind.

edit: The reason is that when you bind to a source IP an entire source port has to be reserved because the destination IP and port is not yet known, so the 5 tuple can’t be used. That’s why IP_BIND_ADDRESS_NO_PORT was introduced in the linux kernel (and nginx as well as haproxy for that matter).

Ok, got it. Thank you so much for idea with IP_BIND_ADDRESS_NO_PORT. WIll report here.

Still I don’t understand what a bind call to 0.0.0.0 is supposed to achieve? What is your proxy_bind configuration and why do you need it? How many IP addresses has your server?

I digged into the problem deeper and would like to share results.

Indeed, tcp-checks leave only ~2-3K tcp ports in TIME_WAIT condition (per upstream). Most of observed TIME_WAITs come from traffic (Redis clients). After client gets data from server it closes connection to Haproxy by sending FIN. Haproxy starts closing connection to upstream Redis server by sending him FIN too, first. Server replies with FIN, ACK, so tcp socket at proxy’s side transitions to TIME_WAIT state for duration of 2*MSL. Usually server has to send FIN first but not in this case, I think.

Rearding Nginx. It uses bind() call with sin_addr set to $proxy_bind (I was wrong saying about “0.0.0.0”) but Haproxy uses this IP address too. So this leads to inability for Nginx to create outgoing connection.

So, quick solution is configuring Haproxy to use different outgoing IP address.

And, what should make sense too - allow Haproxy to manage use of ports with help of "source <ip>:<port-range>" option.

The question remains: if it’s possible to specify the same outgoing IP address for several independent backend groups? Will be there any conflict or not?

Why do you need proxy_bind in the first place? Isn’t that your primary IP address that’s used anyway? Why don’t you just remove proxy_bind from nginx?

Yes, but with IP_BIND_ADDRESS_NO_PORT support please.

I strongly advise against this. Not only will it not work across different backends - it will conflict with each other, but it will also conflict with other haproxy processes, if any (nbproc > 1).

Please share nginx, haproxy, libc, kernel and OS release.

I use proxy_bind to specify outgoing address other than primary inteface address, i.e. primary address is from 10.10.1.0/24 subnet, but connections to upstream are allowed only from 10.10.2.0/24.

Do I need both Haproxy and Nginx to be be compiled with this option enabled?

Unfortunately, I have nbproc > 1 (tried to use threaded mode but got segfaults with current Haproxy version).

Linux: CentOS 7.7 bulld 1908 x86_64
Kernel: 4.19.65-1 (manually built)
Glibc: glibc-2.17-292
Haproxy: 1.9.6 (going to upgrade to latest)
Nginx: 1.15.8 (going to upgrade to latest)

You need to IP_BIND_ADDRESS_NO_PORT support where you are setting source ports (in this case, nginx).

However Kernel, haproxy and nginx are already supporting this, only glibc is not (so it cannot work).

Personally I would suggest dropping the custom builds and upgrading to CentOs 8. But if that’s not a possibility, you’d need to make sure nginx supports IP_BIND_ADDRESS_NO_PORT despite libc lacking support for it. Ask the nginx folks how to manually define IP_BIND_ADDRESS_NO_PORT=24 during the build, for the lack of support in your libc.

According to https://bugzilla.redhat.com/show_bug.cgi?id=1579451 RHEL 7 already has support for IP_BIND_ADDRESS_NO_PORT both in kernel and glibc.

Defining IP_BIND_ADDRESS_NO_PORT for gcc via -D option looks simple.

Do you see checking for IP_BIND_ADDRESS_NO_PORT ... found in the nginx configure script?

Output says:
[skipped]

checking for IP_TRANSPARENT … found
checking for IP_BINDANY … not found
checking for IP_BIND_ADDRESS_NO_PORT … found
checking for IP_RECVDSTADDR … not found
checking for IP_SENDSRCADDR … not found
checking for IP_PKTINFO … found

[skipped]