HAProxy community

Server-template still keeps out-of-date server not in SRV any more

Hi,

server-template is a great feature for Kubernetes haproxy users, but we are suffering a problem with haproxy 2.0.14 (it is the only version we’ve ever tried) with server-template as the backend server discovery on Kubernetes.

The background is we use haproxy as load balancer to accept incoming requests from the client and load balance the request to our backend app servers running in the different pods in the same Kube namespace. Meanwhile, we have several such deployments in other kube namespaces too.

We found server-template still keeps the out-of-date backend server pod’s IP, when the server is marked to be down, this happens when we do the scaling down, which reduces the number of app pods in the backend by kubectl scale command.

The problem with this behavior is: the already deleted pod’s IP will be reclaimed and recycled by Kubernetes, and that IP will be used when some other new pod is created after some time. And in some not-rare case, the new pod can be similar app pod but running in another Kubernetes namespace. And that violates the rule of how we expect server-template to work: only cares what the SRV record says about the endpoints(pod IPs) of the backend app service in the current namespace.

Here is what we are seeing from show servers state:

   $  echo "show servers state" | socat ./admin-1.sock
    # be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port srvrecord
    8 app-perf-us-south-01 1 server-1 172.30.254.86 2 64 1 1 83648 15 3 4 6 0 0 0 172-30-254-86.app.app-perf.svc.cluster.local 5983 _https._tcp.app.app-perf.svc.cluster.local
    8 app-perf-us-south-01 2 server-2 172.30.233.39 2 64 1 1 83643 15 3 4 6 0 0 0 172-30-233-39.app.app-perf.svc.cluster.local 5983 _https._tcp.app.app-perf.svc.cluster.local
    8 app-perf-us-south-01 3 server-3 172.30.47.190 0 64 1 1 34795 7 2 0 6 0 0 0 - 5983 _https._tcp.app.app-perf.svc.cluster.local

so we can see the server-3 with IP of 172.30.47.190 is still here with an empty srv_fqdn and the srv_op_state is 0.

while the dig to the SRV record at the same time shows there’s no such entry can be resolved to this IP:

$ dig -t SRV _https._tcp.app.app-perf.svc.cluer.local +short
0 4 5983 172-30-130-236.app.app-perf.svc.cluster.local.
0 4 5983 172-30-139-184.app.app-perf.svc.cluster.local.

only 2 entries are here.

So once 172.30.47.190 was later used by another pod, and haproxy thinks it’s back and starts to distribute traffic to it no matter if or not where the new pod lives in (another namespace, or maybe even another kind of pod).

So this is like a problem that blocking us to use server-template in Kubernetes. I am wondering if I can get any help on this case to identify if I missed any configuration to make server-template as I expect: always keep the server list up-to-date with what the SRV record resolves.

Here is the haproxy -vv output

HA-Proxy version 2.0.14 2020/04/02 - https://haproxy.org/
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = gcc
  CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_THREAD=1 USE_PTHREAD_PSHARED=1 USE_REGPARM=1 USE_STATIC_PCRE2=1 USE_OPENSSL=1 USE_LUA=1 USE_SLZ=1 USE_TFO=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD +PTHREAD_PSHARED +REGPARM -STATIC_PCRE +STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -MY_ACCEPT4 -ZLIB +SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=32).
Built with OpenSSL version : OpenSSL 1.1.1g  21 Apr 2020
Running on OpenSSL version : OpenSSL 1.1.1g  21 Apr 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.4
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE2 version : 10.30 2017-08-14
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTX        side=FE|BE     mux=H2
              h2 : mode=HTTP       side=FE        mux=H2
       <default> : mode=HTX        side=FE|BE     mux=H1
       <default> : mode=TCP|HTTP   side=FE|BE     mux=PASS

Available services : none

Available filters :
	[SPOE] spoe
	[COMP] compression
	[CACHE] cache
	[TRACE] trace

It will be great if I can get help here. Thank you very much.

We’ve observed this too in our environment which is powered by Hashicorp consul + nomad. Using a small example service, we might have two running containers on ephemeral ports, and then deploy a new version of the app running on new containers with different ephemeral ports. When those new containers become healthy, they are put into DNS for haproxy to retrieve, and the old container addresses are removed from DNS so that haproxy will mark them down.

At first this works really well. Assuming example-app1 and example-app2 are currently up, here’s what happens during a deploy of two new containers:

May  1 13:58:23 NOTICE:  b_example-app/example-app3 changed its FQDN from (null) to ansible.node.vagrant.consul by 'SRV record'
May  1 13:58:26 NOTICE:  Server b_example-app/example-app3 ('ansible.node.vagrant.consul') is UP/READY (resolves again).
May  1 13:58:26 NOTICE:  Server b_example-app/example-app3 administratively READY thanks to valid DNS answer.
May  1 13:58:26 NOTICE:  b_example-app/example-app4 changed its FQDN from (null) to ansible.node.vagrant.consul by 'SRV record'
May  1 13:58:29 NOTICE:  Server b_example-app/example-app4 ('ansible.node.vagrant.consul') is UP/READY (resolves again).
May  1 13:58:29 NOTICE:  Server b_example-app/example-app4 administratively READY thanks to valid DNS answer.

^ New containers are placed in DNS and consequently haproxy enables the backend servers. Great. Next, the old containers are removed from DNS, so haproxy marks them down for maintenance:

May  1 13:58:42 ALERT:  Server b_example-app/example-app1 is going DOWN for maintenance (No IP for server ). 3 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
May  1 13:58:42 ALERT:  Server b_example-app/example-app2 is going DOWN for maintenance (No IP for server ). 2 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

So far so good! This is the desired behaviour when designed for when we rolled out haproxy. Recently though some product teams have been seeing transient 503s when they deploy their apps, and after digging in I’ve determined that we’re seeing the same behaviour that @xugy is describing. Sometimes haproxy behaves as we expect, and sometimes this happens instead:

May  1 14:04:04 NOTICE:  b_example-app/example-app3 changed its FQDN from (null) to ansible.node.vagrant.consul by 'SRV record'
May  1 14:04:04 NOTICE:  b_example-app/example-app4 changed its FQDN from (null) to ansible.node.vagrant.consul by 'SRV record'
May  1 14:04:07 NOTICE:  b_example-app/example-app3 changed its IP from  to 127.0.0.1 by consul/consul.
May  1 14:04:07 NOTICE:  b_example-app/example-app4 changed its IP from  to 127.0.0.1 by DNS cache.
May  1 14:04:07 NOTICE:  Server b_example-app/example-app3 is UP, reason: Layer4 check passed, check duration: 0ms. 3 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.
May  1 14:04:07 NOTICE:  Server b_example-app/example-app4 is UP, reason: Layer4 check passed, check duration: 0ms. 4 active and 1 backup servers online. 0 sessions requeued, 0 total in queue. 

^ New containers rolling out as normal.

May  1 14:05:11 ALERT:  Server b_example-app/example-app1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 3 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
May  1 14:05:11 ALERT:  Server b_example-app/example-app2 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

^ Shoot. Records for these backend servers were removed from DNS, but haproxy does not mark them down for maintenance, and continues sending traffic to them. The old containers are eventually shut off, which causes haproxy to return 503 when it can’t proxy traffic to them. It eventually makes them down due to healthcheck failures, not due to DNS.

There doesn’t seem to be much rhyme or reason to it though. Sometimes backend servers are marked down for maintenance on deploy (desired behaviour) and sometimes they’re not, meaning we get bursts of 503s until haproxy marks it as unhealthy. Because haproxy still acts as if it hasn’t been removed from DNS, I can jump onto the instance where it thinks the container is and nc -l <port>. Haproxy will believe the backend server is healthy again and start sending traffic to me, even though that IP:port isn’t in the SRV record anymore.

Our port range is wide enough that we’re less worried about accidentally sending traffic to the wrong app, but there’s still a non-zero chance, and anyway, it’s resulting in these transient 503s.

haproxy -vv output:

HA-Proxy version 2.1.3 2020/02/12 - https://haproxy.org/
Status: stable branch - will stop receiving fixes around Q1 2021.
Known bugs: http://www.haproxy.org/bugs/bugs-2.1.3.html
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
  OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_THREAD=1 USE_REGPARM=1 USE_LINUX_TPROXY=1 USE_OPENSSL=1 USE_ZLIB=1 USE_TFO=1 USE_NS=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +REGPARM -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO +OPENSSL -LUA +FUTEX +ACCEPT4 -MY_ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=2).
Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services : none

Available filters :
	[SPOE] spoe
	[CACHE] cache
	[FCGI] fcgi-app
	[TRACE] trace
	[COMP] compression

Hope tagging onto the original post is ok – it appears to be the same behaviour, but I can open a new post if that’s preferred.

I’ve dug a bit deeper but I think I need someone a lot smarter than me to figure out what’s going on. At least in my case (and hopefully @xugy can confirm or deny), the problem appears to be related to the server state file.

My config includes:

global
    ...
    server-state-file /tmp/haproxy.state

And then later:

defaults
    ...
    load-server-state-from-file global

Server state is dumped from the socket to this file prior to reload.

Anyway, when I start haproxy up on a fresh box with no state file, DNS resolution appears to function normally. Haproxy sets backend servers up and down normally enough as the SRV record changes. The IPs of downed servers still appear in “show server state” but this is fine since haproxy considers them down for maintenance.

Things get weird after reloading haproxy with the state file present. When haproxy is reloaded this way, servers in the UP state (and maybe others) no longer respond to DNS changes.

I did the crudest thing possible by adding some logging to the snr_resolution_cb function in server.c, and found that it’s returning early for servers whose state was persisted from the state file:

s = objt_server(requester->owner);
if (!s)
  return 1;

If I stop haproxy and blow away the state file, the function moves on and I can see it moving on to the switch switch (ret) where it evaluates the response.

Anyway long story short: something about reloading from the state file short-circuits the DNS resolver.

I didn’t use server-state-file but still am seeing the problem.

Cheers. I’ve opened an issue report for what I’m seeing, since it sounds like we’re encountering different issues.

For the problem you’re encountering, it sounds like haproxy is resolving only at startup, which would occur if you’re init’ing with libc but there’s a problem with the resolvers section. Are you able to paste the full haproxy configuration, or at least an anonymized/redacted version?

Thank you wolfsimple I am not sure if these 2 problem shares the same root cause at the moment.

I think you’re right, and these two problems do not share the same root cause.

That said, I’ve plumbed pretty deep into how DNS resolvers work in haproxy, so if you’re able to post your config, I still might be able to help out with the problem you’re experiencing.

Hello community, do we have any thought on this issue?

Tagging @Baptiste