Server-template stops taking updates from DNS

Hoping you can help. I’m seeing an issue with 1.8.14 and also 1.8.16 DNS service discovery where HAProxy no longer picks up changes from DNS. I have a server-template with a single slot and point that at DNS. Initially things work but randomly as server-state-file reconfigs happen and DNS gets updated with new ports, the backend gets stuck on the previous no longer existing host/port combination. We have multiple servers configured the same way and they randomly get stuck like this.

For example a DNS entry for _testapp_http._tcp.marathon.mesos would point to localhost:24379 at one point in time then that service would go away and re-recreated on localhost:13903 and DNS updated. Most of the time HAProxy picks up the change but occasionally it will stick forever on the old localhost:24379.

A tcpdump of DNS shows the correct new entry being returned:
1 9:36:37.401865 IP localhost.39124 > localhost.domain: 33907+ [1au] SRV? _testapp_http._tcp.marathon.mesos. (63)
19:36:37.402016 IP localhost.domain > localhost.39124: 33907* 1/0/2 SRV localslave.marathon.mesos.:13903 20 25344 (124)

The ‘show stat’ CLI shows the old port:
health_testapp,testapp_health1,0,0,0,0,64,0,0,0,0,0,0,0,0,DOWN,100,1,0,0,1,138,138,1,8,1,0,2,0,0,L4CON,0,0,0,0,0,0,0,0,0,-1,Connection refused,0,0,0,0,Layer4 connection problem,3,2,0,127.0.0.1:24379,http,
health_testapp,BACKEND,0,0,0,1,200,70,4760,14840,0,0,70,0,0,0,DOWN,0,0,0,1,138,138,1,8,0,0,1,1,1,0,0,0,0,70,0,70,0,0,0,0,0,0,-1,0,0,0,0,http,roundrobin,

The server-state-file shows the old port:
8 health_testapp 1 testapp_health1 127.0.0.1 0 0 100 1 201 8 2 0 6 0 0 0 localslave.marathon.mesos 24379 _testapp_http._tcp.marathon.mesos

server-template config is:
server-template testapp_health 1 _testapp_http._tcp.marathon.mesos resolvers localdns resolve-prefer ipv4 maxconn 64 rise 3 fall 2 check inter 10000

I tried using ‘resolve-opts allow-dup-ip’ but it didn’t help. It seems like that slot is permanently stuck for some reason? Some race between the server-state-file reload and DNS updates?

Any workaround or fix would be appreciated.

Thanks,
Steve

Any ideas?

The host always remains localhost, only the port changes, right? Can you provide the the configuration and the output of haproxy -vv.

Can you also provide the output of show servers state as well as show stat resolvers?

@Baptiste any idea how we could troubleshoot this?

(for some reasons, I can’t get logged in into discourse… trying to answer by email).

An other important info that would be required are HAProxy “administrative” logs (not the traffic ones) because HAProxy should generate a log line each time it can’t find a good match between the state file and the current configuration.

There must be a moment in time where a mismatch happens (for some reasons) and I want to understand how/why.

Baptiste

Thanks for the replies. I’m trying to reproduce the issue again to get the admin logs but here’s the haproxy -vv and sample config. You’re correct in this case the DNS will always return a localhost record though the port may change. We have a couple hundred backends and at any point backends can be added/removed (which triggers a config reload via the server-state-file) and/or DNS can return different ports. I think I have seen those mismatch lines in the log files and I’m trying to get some examples.

HA-Proxy version 1.8.14-52e4d43 2018/09/20
    Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

    Build options :
      TARGET  = linux2628
      CPU     = i686
      CC      = gcc
      CFLAGS  = -m64 -march=opteron -mno-3dnow -ggdb -O2 -Wall -I=/workspace/common/include -L=/workspace/common/lib
      OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_STATIC_PCRE=1

    Default settings :
      maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

    Built with OpenSSL version : OpenSSL 1.0.2l  25 May 2017
    Running on OpenSSL version : OpenSSL 1.0.2p-fips  14 Aug 2018 (VERSIONS DIFFER!)
    OpenSSL library supports TLS extensions : yes
    OpenSSL library supports SNI : yes
    OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
    Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
    Encrypted password support via crypt(3): yes
    Built with multi-threading support.
    Built with PCRE version : 8.38 2015-11-23
    Running on PCRE version : 8.38 2015-11-23
    PCRE library supports JIT : no (USE_PCRE_JIT not set)
    Built with zlib version : 1.2.8
    Running on zlib version : 1.2.8
    Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
    Built with network namespace support.

    Available polling systems :
          epoll : pref=300,  test result OK
           poll : pref=200,  test result OK
         select : pref=150,  test result OK
    Total: 3 (3 usable), will use epoll.

    Available filters :
    	[SPOE] spoe
    	[COMP] compression
    	[TRACE] trace

Example config

global
	maxconn 5000
	nbproc 1
	log 127.0.0.1 local1
    log 127.0.0.1 local2 notice
	pidfile /usr/local/test/etc/test_haproxy.pid
   server-state-file /usr/local/test/test-haproxy/conf/server-state
	stats socket /var/run/haproxy.sock mode 600 level admin
	stats timeout 2m

defaults
	log global
	mode http

    load-server-state-from-file global

    log-format %ci:%cp\ [%t]\ %ft\ %b/%s\ %Tq/%Tw/%Tc/%Tr/%Tt\ %ST\ %B\ %CC\ %CS\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %hr\ %hs\ L/%ID\ [%[src,map(/usr/local/test/etc/name-mapping.lst)]\ %[src,map(/usr/local/test/etc/id-mapping.lst)]]\ %{+Q}r

	retries 2
    timeout connect 4s
    timeout client 10s
    timeout server 180s
    timeout check 8s
    timeout http-keep-alive 91s
    timeout http-request 5s

    option clitcpka
	option abortonclose
	option forwardfor

	balance roundrobin

	option forwardfor except 127.0.0.1
	option log-health-checks

	default-server init-addr last,libc,none

  	monitor-uri /this-is-health-check/status

resolvers localdns
    accepted_payload_size 8192
    nameserver dns1 localhost:53
    resolve_retries       3
    timeout resolve       3s
    timeout retry         3s
    hold other           30s
    hold refused         30s
    hold nx              30s
    hold timeout         30s
    hold valid           10s
    hold obsolete        30s

frontend http-in
	bind *:80
	bind *:443 ssl crt /usr/local/test/test-haproxy/conf/secrets/proxysslcert.pem

	acl acl_test-java path_beg /test-java/
	use_backend backend_test-java if acl_test-java	
	
	
backend backend_test-java
	balance roundrobin
	http-request del-header Proxy
	option http-server-close
	option httpchk GET /test-java/system/ping HTTP/1.1
	server-template test-java_health 1 _test-java_http._tcp.marathon.mesos resolvers localdns resolve-prefer ipv4 maxconn 64 rise 3 fall 2 check inter 10000

In the state my test system is in now I can restart HAProxy and it remains misconfigured. I don’t see any of the mismatch lines since it looks happy to apply the old port to the backend it finds. It seems once it reloads from the server-state file that contains the incorrect port it doesn’t take any updates. Here is what is in the admin log, show servers state, show stat resolvers, tcpdump of DNS. I guess since I can reproduce it after a restart I could add more debug logging to try to figure out why it ignores the result from DNS?

Feb 11 20:36:55 192.168.64.163 haproxy[29451]:  Proxy health_system started.
Feb 11 20:36:56 192.168.64.163 haproxy[29492]:  Stopping backend health_system in 0 ms.
Feb 11 20:36:56 192.168.64.163 haproxy[29492]:  Proxy health_system stopped (FE: 0 conns, BE: 9 conns).
Feb 11 20:36:56 192.168.64.163 haproxy[29888]:  Server health_system/system_health1 is DOWN, changed from server-state after a reload. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Feb 11 20:36:56 192.168.64.163 haproxy[29888]:  backend health_system has no server available!

show servers state

10 health_system 1 system_health1 127.0.0.1 0 0 100 1 12 8 2 0 6 0 0 0 localslave.marathon.mesos 22630 _system_http._tcp.marathon.mesos

show stat resolvers

health_system,system_health1,0,0,0,0,64,0,0,0,,0,,0,0,0,0,DOWN,100,1,0,0,1,11,11,,1,10,1,,0,,2,0,,0,L4CON,,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Connection refused,,0,0,0,0,,,,Layer4 connection problem,,3,2,0,,,,127.0.0.1:22630,,http,,,,,,,,
health_system,BACKEND,0,0,0,1,200,5,340,1060,0,0,,5,0,0,0,DOWN,0,0,0,,1,11,11,,1,10,0,,0,,1,1,,2,,,,0,0,0,0,5,0,,,,5,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,roundrobin,,,,,,,

tcpdump of DNS

20:41:22.786430 IP localhost.33285 > localhost.domain: 54776+ [1au] SRV? _system_http._tcp.marathon.mesos. (63)
20:41:22.786533 IP localhost.domain > localhost.33285: 54776* 1/0/2 SRV localslave.marathon.mesos.:12523 20 25344 (124)

Did you ever resolve this issue? I’m diagnosing similar issues on HAProxy 2.0.5 with server-template and SRV records.

From my experience with server-template it is processed only at config (re)load time as a mean to create the backend servers and I haven’t seen them being updated dynamically.

Disclaimer: not super experienced with HAProxy, so please take this with a pinch of salt

@aitorpazos this may be because you have not enabled “resolvers” to have DNS resolution at runtime