Stops checking health

Hi,

I am using HaProxy in Kubernetes to do an active passive balancing on a statefulset with 2 nodes. This is the config I am using:

global
  log stdout format raw local0

defaults
  default-server init-addr libc,none
  log global
  mode http
  timeout client 20s
  timeout server 5s
  timeout connect 4s

frontend port_8080
  bind *:8080
  mode http
  default_backend repository

backend repository
  option log-health-checks
  mode http
  option httpchk GET /alfresco/api/-default-/public/alfresco/versions/1/probes/-live-
  default-server inter 10s downinter 5s
  server repo0 "${active}:8080/" check fall 2 rise 4
  server repo1 "${passive}:8080/" check backup fall 5 rise 2

{active} and {passive} are environment variables pointing to the active and the passive node.

This works great so far that all traffic is routed to the active node, and nothing to the passive node.

When I then kill the active node on purpose, haproxy realises this very quickly and all traffic gets routed to the passive node. But it then stopps checking the active node if it comes back. I don’t see any failed checks in the logs, it seems haproxy stopps checking all together.
When I then (after the active node is back) kill the passive node, haproxy states it has no backendservers available anymore and stopps working all together.

How can I configure haproxy to still do healthchecks and failback to the active after it is reachable again?

Thanks!

Some additional information:

I have two different kubernetes clusters, one with openshift, one with kubeadm. In the kubeadm it works just fine, in the openshift one not. The difference between the both seems to be that in openshift its a Layer 4 timeout “No Route to Host”, in the kubeadm its a Layer 7 timeout:

[WARNING] 197/073245 (7) : Health check for backup server repository/repo1 succeeded, reason: Layer7 check passed, code: 200, check duration: 4ms, status: 5/5 UP.
Health check for backup server repository/repo1 succeeded, reason: Layer7 check passed, code: 200, check duration: 4ms, status: 5/5 UP.
Health check for server repository/repo0 succeeded, reason: Layer7 check passed, code: 200, check duration: 7ms, status: 2/2 UP.
[WARNING] 197/073245 (7) : Health check for server repository/repo0 succeeded, reason: Layer7 check passed, code: 200, check duration: 7ms, status: 2/2 UP.
[WARNING] 197/083207 (7) : Health check for server repository/repo0 failed, reason: Layer4 timeout, check duration: 10002ms, status: 1/2 UP.
Health check for server repository/repo0 failed, reason: Layer4 timeout, check duration: 10002ms, status: 1/2 UP.
10.254.4.1:42066 [16/Jul/2020:08:31:56.114] port_8080 repository/repo0 0/0/-1/-1/16009 503 222 - - sC-- 1/1/0/0/3 0/0 "GET /alfresco/s/enterprise/admin HTTP/1.1"
Health check for server repository/repo0 failed, reason: Layer4 timeout, check duration: 10001ms, status: 0/4 DOWN.
[WARNING] 197/083227 (7) : Health check for server repository/repo0 failed, reason: Layer4 timeout, check duration: 10001ms, status: 0/4 DOWN.
[WARNING] 197/083227 (7) : Server repository/repo0 is DOWN. 0 active and 1 backup servers left. Running on backup. 1 sessions active, 0 requeued, 0 remaining in queue.
Server repository/repo0 is DOWN. 0 active and 1 backup servers left. Running on backup. 1 sessions active, 0 requeued, 0 remaining in queue.
10.254.4.1:42066 [16/Jul/2020:08:32:20.248] port_8080 repository/repo0 0/0/-1/-1/13225 503 0 - - sC-- 1/1/0/0/3 0/0 "GET /alfresco/s/enterprise/admin HTTP/1.1"
Health check for server repository/repo0 failed, reason: Layer4 connection problem, info: "No route to host", check duration: 1241ms, status: 0/4 DOWN.
[WARNING] 197/083233 (7) : Health check for server repository/repo0 failed, reason: Layer4 connection problem, info: "No route to host", check duration: 1241ms, status: 0/4 DOWN.

Can you provide the output of haproxy -vv please? What’s the OS?

Hi,

I was able to fix the issue with the help of a collegue. The problem was that haproxy wasn’t deleting its DNS Cache, and therefor the system stayed down, as it got a new IP address.

By adding this to the haproxy.cfg:

resolvers myresolver
  parse-resolv-conf
  resolve_retries       30
  timeout retry         4s
  hold valid           60s
  hold nx              10s

it works as expected.

1 Like