Help ! with piling backup sessions at certain interval

Well it looks like connection from “worker2” are piling up in FIN_WAIT2 state.

Do you have another load-balancer in front of it and all requests come from the same IP? worker2 doesn’t seem like a backend server, because it has high and random source ports connecting to your port 443 as far as I can tell from the netstat output. Can you clarify what IP worker2 is?

Sure, I also forget one important point: maxconn needs to be double that number, because the connection on the backend also has to be considered.

So, global maxconn is the maximum number of connections that one haproxy process handles, it should never be reached, because it will mean that the entire process does no longer handle any request, not even for example the stats page.

To avoid this, maxconn will also be configured per frontend, and it also has an impact on the backend (because 1 frontend connection usually means 1x backend connections, we have to double the number I talked about earlier).

Now if we put a maxconn configuration into the default section like in your configuration, this means that ever frontend will inherit the value from the default section.

So here is an example:

global
 maxconn 100

defaults
 maxconn 10

frontend a
frontend b
frontend c
frontend d
frontend e

Which is short for:

global
 maxconn 100

frontend a
 maxconn 10
frontend b
 maxconn 10
frontend c
 maxconn 10
frontend d
 maxconn 10
frontend e
 maxconn 10

In this case we have a global (process) value of 100, and each of the five frontends has maxconn 10.

So we have 5 x 10 = 50 total for all the frontends, plus we need to double this value to account for the connections on the backend, so we are at 100, which matches exactly the global configuration. This would probably work, because it’s unlikely that all frontends are 100% at maxconn, but we should give global maxconn some room. So in this case for example we would maybe bump maxconn to 110 or something.

Yes you can syslog every request if you want, see more about logging configuration here, however I believe that in this case you probably won’t see anything particolarly special in the haproxy logs, because I think we are not properly timing out our TCP sessions and this is what in the end causes those issues.

I see 2 problems here:

  • connections from worker2 are piling up in in FIN_WAIT2 state, probably in the proxy-https frontend
  • global maxconn is reached before frontend maxconn is reached, this will cause haproxy to stop accepting new connections. If global maxconn would not be reached, but frontend maxconn, you would only see that specific frontend no longer accepting new connection, which means other frontends would still work, as would the stats interface of haproxy, where you could extract usefull informations about the number of active session in the specific frontend. When global maxconn is reach though, not even the stats interface would work.

So I’m suggesting multiple things:

  • make sure global maxconn is configured considering the points explained above
  • you can keep client/server timeouts high, but I strongly suggest you specifically configure client-fin and server-fin timeouts to low values such as 60 seconds (put into the default section timeout client-fin 60s and timeout server-fin 60s). What this means is then is that connection should not be in FIN_WAIT states for hours, but only a few minutes and I believe this is the root cause of your issue.
  • also double check the sysctl net.ipv4.tcp_fin_timeout is at it’s default of 60 seconds (not some very large value)
2 Likes