HAproxy health check problems

Hi,

We have recently moved from HAProxy 1.5.16 version to HAproxy 1.8.14 version and we noticed changed behaviour when health check is configured for a server. The health check will be performed even if a server is put into a maintenance mode, which leads to servers to produce 503 errors when they are still not ready to serve traffic, examples :


Nov 13 11:42:19 <ip-address> haproxy[30218]: Server server_name/ID_12345 is going DOWN for maintenance. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Nov 13 11:42:32 <ip-address> haproxy[29329]: Server server_name/ID_12345 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Nov 13 11:42:32 <ip-address> haproxy[29329]: <ip-address>:51667 [13/Nov/2018:11:42:32.571] server_name server_name/ID_23456 0/0/1/47/48 200 9502 - - ---- 11/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:36 <ip-address> haproxy[29329]: <ip-address>:58673 [13/Nov/2018:11:42:36.933] server_name server_name/ID_23456 0/0/1/50/51 200 15547 - - ---- 7/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:37 <ip-address> haproxy[30218]: <ip-address>:59146 [13/Nov/2018:11:42:37.238] server_name server_name/ID_23456 0/0/1/47/49 200 15980 - - ---- 11/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:41 <ip-address> haproxy[29329]: <ip-address>:37371 [13/Nov/2018:11:42:41.334] server_name server_name/ID_23456 0/0/1/38/39 200 10163 - - ---- 14/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:42 <ip-address> haproxy[30218]: <ip-address>:38616 [13/Nov/2018:11:42:42.111] server_name server_name/ID_23456 0/0/1/45/46 200 9058 - - ---- 6/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:44 <ip-address> haproxy[30218]: <ip-address>:42563 [13/Nov/2018:11:42:44.645] server_name server_name/ID_23456 0/0/1/47/48 200 14814 - - ---- 13/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:46 <ip-address> haproxy[29329]: <ip-address>:45652 [13/Nov/2018:11:42:46.557] server_name server_name/ID_23456 0/0/1/46/47 200 14466 - - ---- 11/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:48 <ip-address> haproxy[29329]: <ip-address>:48349 [13/Nov/2018:11:42:48.247] server_name server_name/ID_23456 0/0/1/47/48 200 9763 - - ---- 15/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:50 <ip-address> haproxy[29329]: Server server_name/ID_12345 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.
Nov 13 11:42:51 <ip-address> haproxy[30218]: <ip-address>:53951 [13/Nov/2018:11:42:51.752] server_name server_name/ID_23456 0/0/1/42/43 200 8973 - - ---- 10/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:52 <ip-address> haproxy[30218]: <ip-address>:54762 [13/Nov/2018:11:42:52.246] server_name server_name/ID_23456 0/0/1/29/30 200 7939 - - ---- 9/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:53 <ip-address> haproxy[30218]: <ip-address>:56921 [13/Nov/2018:11:42:53.642] server_name server_name/ID_23456 0/0/1/35/36 200 10316 - - ---- 8/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:42:56 <ip-address> haproxy[30218]: <ip-address>:33091 [13/Nov/2018:11:42:56.365] server_name server_name/ID_23456 0/0/1/30/31 200 6769 - - ---- 6/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:00 <ip-address> haproxy[30218]: <ip-address>:39712 [13/Nov/2018:11:43:00.513] server_name server_name/ID_23456 0/0/1/31/32 200 6974 - - ---- 10/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:01 <ip-address> haproxy[30218]: <ip-address>:41322 [13/Nov/2018:11:43:01.509] server_name server_name/ID_23456 0/0/1/42/43 200 12216 - - ---- 21/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:02 <ip-address> haproxy[30218]: <ip-address>:42719 [13/Nov/2018:11:43:02.453] server_name server_name/ID_23456 0/0/0/40/40 200 9517 - - ---- 11/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:03 <ip-address> haproxy[30218]: <ip-address>:43670 [13/Nov/2018:11:43:03.028] server_name server_name/ID_23456 0/0/1/41/42 200 11020 - - ---- 11/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:09 <ip-address> haproxy[30218]: <ip-address>:53492 [13/Nov/2018:11:43:09.240] server_name server_name/ID_23456 0/0/1/40/41 200 8786 - - ---- 11/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:11 <ip-address> haproxy[30218]: <ip-address>:57460 [13/Nov/2018:11:43:11.747] server_name server_name/ID_23456 0/0/0/68/68 200 15486 - - ---- 8/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:24 <ip-address> haproxy[30218]: <ip-address>:49279 [13/Nov/2018:11:43:24.381] server_name server_name/ID_23456 0/0/1/65/67 200 27476 - - ---- 14/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:25 <ip-address> haproxy[30218]: <ip-address>:51319 [13/Nov/2018:11:43:25.749] server_name server_name/ID_23456 0/0/1/32/33 200 10350 - - ---- 9/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:32 <ip-address> haproxy[30218]: <ip-address>:34294 [13/Nov/2018:11:43:32.776] server_name server_name/ID_23456 0/0/1/48/49 200 14869 - - ---- 11/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:33 <ip-address> haproxy[30218]: <ip-address>:35563 [13/Nov/2018:11:43:33.582] server_name server_name/ID_23456 0/0/1/37/38 200 8728 - - ---- 10/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:35 <ip-address> haproxy[29329]: <ip-address>:38459 [13/Nov/2018:11:43:35.502] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 17/5/4/4/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:39 <ip-address> haproxy[29329]: <ip-address>:45486 [13/Nov/2018:11:43:39.979] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 11/5/4/4/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:40 <ip-address> haproxy[30218]: <ip-address>:45524 [13/Nov/2018:11:43:40.011] server_name server_name/ID_23456 0/0/1/34/35 200 6976 - - ---- 7/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:40 <ip-address> haproxy[29329]: <ip-address>:46572 [13/Nov/2018:11:43:40.784] server_name server_name/ID_12345 0/0/1/-1/1 -1 0 - - SD-- 13/5/4/4/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:54 <ip-address> haproxy[30218]: <ip-address>:39131 [13/Nov/2018:11:43:54.081] server_name server_name/ID_23456 0/0/1/36/37 200 9358 - - ---- 6/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:43:56 <ip-address> haproxy[29329]: <ip-address>:42944 [13/Nov/2018:11:43:56.645] server_name server_name/ID_12345 0/0/1/-1/18 -1 0 - - SD-- 12/5/4/4/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:00 <ip-address> haproxy[29329]: <ip-address>:49436 [13/Nov/2018:11:44:00.734] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 15/5/4/4/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:09 <ip-address> haproxy[29329]: <ip-address>:34196 [13/Nov/2018:11:44:09.123] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 15/5/4/4/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:09 <ip-address> haproxy[29329]: <ip-address>:35345 [13/Nov/2018:11:44:09.849] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 16/5/4/4/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:10 <ip-address> haproxy[29329]: <ip-address>:55591 [13/Nov/2018:11:43:10.556] server_name server_name/ID_12345 0/0/0/-1/60001 504 194 - - sH-- 8/4/3/3/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:14 <ip-address> haproxy[29329]: <ip-address>:33254 [13/Nov/2018:11:43:14.261] server_name server_name/ID_12345 0/0/0/-1/60002 504 194 - - sH-- 13/3/2/2/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:14 <ip-address> haproxy[30218]: <ip-address>:42456 [13/Nov/2018:11:44:14.366] server_name server_name/ID_23456 0/0/1/49/50 200 14224 - - ---- 4/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:18 <ip-address> haproxy[29329]: <ip-address>:49345 [13/Nov/2018:11:44:18.739] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 14/3/2/2/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:19 <ip-address> haproxy[30218]: <ip-address>:50151 [13/Nov/2018:11:44:19.255] server_name server_name/ID_23456 0/0/1/36/37 200 10878 - - ---- 10/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:20 <ip-address> haproxy[29329]: <ip-address>:52028 [13/Nov/2018:11:44:20.522] server_name server_name/ID_12345 0/0/4/-1/5 -1 0 - - SD-- 13/3/2/2/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:21 <ip-address> haproxy[29329]: <ip-address>:52971 [13/Nov/2018:11:44:21.050] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 20/3/2/2/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:24 <ip-address> haproxy[29329]: <ip-address>:59152 [13/Nov/2018:11:44:24.985] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 17/3/2/2/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:25 <ip-address> haproxy[29329]: <ip-address>:59594 [13/Nov/2018:11:44:25.270] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 12/3/2/2/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:26 <ip-address> haproxy[29329]: <ip-address>:33382 [13/Nov/2018:11:44:26.653] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 14/3/2/2/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:26 <ip-address> haproxy[29329]: <ip-address>:34009 [13/Nov/2018:11:44:26.935] server_name server_name/ID_12345 0/0/0/-1/1 -1 0 - - SD-- 17/3/2/2/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:27 <ip-address> haproxy[30218]: <ip-address>:34769 [13/Nov/2018:11:44:27.393] server_name server_name/ID_23456 0/0/1/46/48 200 8453 - - ---- 9/1/0/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:54 <ip-address> haproxy[29329]: <ip-address>:58565 [13/Nov/2018:11:42:54.619] server_name server_name/ID_12345 0/0/1/-1/120003 504 219 - - sH-- 15/2/1/1/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:44:58 <ip-address> haproxy[29329]: <ip-address>:36502 [13/Nov/2018:11:42:58.480] server_name server_name/ID_12345 0/0/1/-1/120003 504 219 - - sH-- 10/1/0/0/0 0/0 "POST /path-to/service HTTP/1.1"
Nov 13 11:45:27 <ip-address> haproxy[30218]: Server server_name/ID_12345 is UP/READY (leaving forced maintenance).

As you can see from the example above server with ID_12345 is forced into maintenance with “disable server” command over the socket, but goes UP because the port tcp check is passing. We however have in the configuration document written that forced maintenance stops health checks as well. Can you advice on how to fix this situation, is that a bug ?
Our current configuration for listen rule looks like this :


listen listenrulename
bind :
mode http
balance leastconn
server ID_12345 : check inter 3000 rise 2 fall 3 maxconn 50 weight 6
server ID_23456 : check inter 3000 rise 2 fall 3 maxconn 50 weight 5 backup

Regards,
Nikolay

You have 2 haproxy processes running. PID 30218 behaves correctly, PID 29329 is what is causing you the pain. Now is PID 29329 an old, obsolete process? Then kill it. If you are running in a nbproc configuration, you need to send the admin socket command to every single process (on a dedicated admin socket).

We are using haproxy seamless reload and a grace of 10min to make sure we drain all the old connections before the proccess is killed by ne next reload ( reloads are done every 10min ). It looks like the old proccess brings up the server even thought we disable it. Any suggestions ?

The old process does not bring the server up; the server never goes into maintenance mode there in the first place.

How long the old process is around depends on a lot of things, the grace period is just a last resort to kill the process. Really a short keep-alive timeout is the best way to have an obsolete process quit asap. Also, of course, if you have long-running transactions like big up or downloads, they will keep the process up forever as well and in this case, periodic reloading is not a good idea.

Can you share more of your configuration, most importantly timeouts? Do you have long running transactions or session switching to tunnel mode (websocket, for example)?

Indeed, the server goes into maintenance mode on the old process, we bring the remote port down ( process / app is updated ), we bring a new remote process up on the same port and the check kicks in before we enable the server back to operational. We figured the state of the server is not passed properly
Configuration is as follows

global
log 127.0.0.1 local0
log 127.0.0.1 local1 debug
maxconn 64000
chroot /home/haproxy
user haproxy
group haproxy
nbproc 1
daemon
pidfile /home/haproxy/haproxy.pid
stats socket /tmp/haproxy user haproxy group haproxy expose-fd listeners level admin
tune.bufsize 131070

defaults
log global
mode http
option httplog
option dontlognull
option allbackups
option redispatch
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
option abortonclose
retries 3
maxconn 2000
timeout connect 5000
timeout client 60000
timeout server 60000
timeout client-fin 10000
grace 600s
option accept-invalid-http-request
option httpclose

We use grace of 10 min and we reload the haproxy every 10 min, because of the dynamic nature on the fleet we have ( servers reconfigured, removed, added … etc ). We notice that if we remove the grace period it will reload the process for some time and after few days we will have an old haproxy process ( not closed correctly ) that will start generatic a lot of 503s, because it uses old configuration and tries to send requests to backend servers that are not existing anymore or somehow unreachable for that process. We are currently exploring the state file, but without much luck so far.

No, that’s not what I was saying. The server goes into maintenance mode on the current haproxy instance, while it does not go into maintenance mode on the obsolete haproxy instance.

The grace configuration (I thougth you where talking about hard-stop-after) is what is actually causing the issue.

Please remove grace completely from your configuration, it actively causes harm in your case. You don’t have to configure grace for the reload to be graceful, it is intended for an entirely different use-case (using different grace values in backends to make the shutdown sequential).

Also, seamless reload and state files are not relevant to your problem.

Any suggestions on how we can fix the rouge process then ?
i can apply keepalive timeout to make this a bit easier for haproxy

It’s not a rogue process, it’s an old process that behaves exactly as per the configuration. Like I said, remove the grace keyword. It does not belong in your configuration and removing it will fix this situation.

You disabled keep-alive with theoption httpclose, tuning the keep-alive option is therefor unnecessary.

If after removing the grace keyword the old process still needs longer to quit due to long running transactions, that is:

  • not relevant to your configuration, because the old process will not terminate any new sessions
  • but you can also limit the time the old process runs with the hard-stop-after keyword, which will kill the process (and therefor it’s connections) after a certain amount of time

I will try the suggestion “hard-stop-after”, as this ocasional “old process” was really making problems in our case.
Thank you for the suggestion, i’ll report back on the outcome.

Just understand that hard-stop-after will kill existing sessions with in flight data. So upload or downloads will be terminated before they complete.

Hi @lukastribus,

The last suggestion did the trick for us. We are now happily using “hard-stop-after” with “option httpclose” and “grace” removed.