Stops checking health

mugnipper · July 15, 2020, 12:24pm

Hi,

I am using HaProxy in Kubernetes to do an active passive balancing on a statefulset with 2 nodes. This is the config I am using:

global
  log stdout format raw local0

defaults
  default-server init-addr libc,none
  log global
  mode http
  timeout client 20s
  timeout server 5s
  timeout connect 4s

frontend port_8080
  bind *:8080
  mode http
  default_backend repository

backend repository
  option log-health-checks
  mode http
  option httpchk GET /alfresco/api/-default-/public/alfresco/versions/1/probes/-live-
  default-server inter 10s downinter 5s
  server repo0 "${active}:8080/" check fall 2 rise 4
  server repo1 "${passive}:8080/" check backup fall 5 rise 2

{active} and {passive} are environment variables pointing to the active and the passive node.

This works great so far that all traffic is routed to the active node, and nothing to the passive node.

When I then kill the active node on purpose, haproxy realises this very quickly and all traffic gets routed to the passive node. But it then stopps checking the active node if it comes back. I don’t see any failed checks in the logs, it seems haproxy stopps checking all together.
When I then (after the active node is back) kill the passive node, haproxy states it has no backendservers available anymore and stopps working all together.

How can I configure haproxy to still do healthchecks and failback to the active after it is reachable again?

Thanks!

mugnipper · July 16, 2020, 8:53am

Some additional information:

I have two different kubernetes clusters, one with openshift, one with kubeadm. In the kubeadm it works just fine, in the openshift one not. The difference between the both seems to be that in openshift its a Layer 4 timeout “No Route to Host”, in the kubeadm its a Layer 7 timeout:

[WARNING] 197/073245 (7) : Health check for backup server repository/repo1 succeeded, reason: Layer7 check passed, code: 200, check duration: 4ms, status: 5/5 UP.
Health check for backup server repository/repo1 succeeded, reason: Layer7 check passed, code: 200, check duration: 4ms, status: 5/5 UP.
Health check for server repository/repo0 succeeded, reason: Layer7 check passed, code: 200, check duration: 7ms, status: 2/2 UP.
[WARNING] 197/073245 (7) : Health check for server repository/repo0 succeeded, reason: Layer7 check passed, code: 200, check duration: 7ms, status: 2/2 UP.
[WARNING] 197/083207 (7) : Health check for server repository/repo0 failed, reason: Layer4 timeout, check duration: 10002ms, status: 1/2 UP.
Health check for server repository/repo0 failed, reason: Layer4 timeout, check duration: 10002ms, status: 1/2 UP.
10.254.4.1:42066 [16/Jul/2020:08:31:56.114] port_8080 repository/repo0 0/0/-1/-1/16009 503 222 - - sC-- 1/1/0/0/3 0/0 "GET /alfresco/s/enterprise/admin HTTP/1.1"
Health check for server repository/repo0 failed, reason: Layer4 timeout, check duration: 10001ms, status: 0/4 DOWN.
[WARNING] 197/083227 (7) : Health check for server repository/repo0 failed, reason: Layer4 timeout, check duration: 10001ms, status: 0/4 DOWN.
[WARNING] 197/083227 (7) : Server repository/repo0 is DOWN. 0 active and 1 backup servers left. Running on backup. 1 sessions active, 0 requeued, 0 remaining in queue.
Server repository/repo0 is DOWN. 0 active and 1 backup servers left. Running on backup. 1 sessions active, 0 requeued, 0 remaining in queue.
10.254.4.1:42066 [16/Jul/2020:08:32:20.248] port_8080 repository/repo0 0/0/-1/-1/13225 503 0 - - sC-- 1/1/0/0/3 0/0 "GET /alfresco/s/enterprise/admin HTTP/1.1"
Health check for server repository/repo0 failed, reason: Layer4 connection problem, info: "No route to host", check duration: 1241ms, status: 0/4 DOWN.
[WARNING] 197/083233 (7) : Health check for server repository/repo0 failed, reason: Layer4 connection problem, info: "No route to host", check duration: 1241ms, status: 0/4 DOWN.

lukastribus · July 16, 2020, 1:51pm

Can you provide the output of haproxy -vv please? What’s the OS?

mugnipper · July 17, 2020, 6:18am

Hi,

I was able to fix the issue with the help of a collegue. The problem was that haproxy wasn’t deleting its DNS Cache, and therefor the system stayed down, as it got a new IP address.

By adding this to the haproxy.cfg:

resolvers myresolver
  parse-resolv-conf
  resolve_retries       30
  timeout retry         4s
  hold valid           60s
  hold nx              10s

it works as expected.

Topic		Replies	Views
Resurrecting backend servers with health checks Help!	1	257	January 12, 2023
Strange timeout behavior during load testing Help!	7	1694	October 5, 2017
HAProxy servers staying down after failed health checks Help!	0	1814	April 11, 2017
HAProxy marks server as down while not being down - inexplicable healthcheck timeouts Help!	3	3664	March 20, 2020
How to disable backend check? Help!	3	6866	February 14, 2020

Stops checking health

Related topics