HAProxy community

Server-template stops taking updates from DNS


#1

Hoping you can help. I’m seeing an issue with 1.8.14 and also 1.8.16 DNS service discovery where HAProxy no longer picks up changes from DNS. I have a server-template with a single slot and point that at DNS. Initially things work but randomly as server-state-file reconfigs happen and DNS gets updated with new ports, the backend gets stuck on the previous no longer existing host/port combination. We have multiple servers configured the same way and they randomly get stuck like this.

For example a DNS entry for _testapp_http._tcp.marathon.mesos would point to localhost:24379 at one point in time then that service would go away and re-recreated on localhost:13903 and DNS updated. Most of the time HAProxy picks up the change but occasionally it will stick forever on the old localhost:24379.

A tcpdump of DNS shows the correct new entry being returned:
1 9:36:37.401865 IP localhost.39124 > localhost.domain: 33907+ [1au] SRV? _testapp_http._tcp.marathon.mesos. (63)
19:36:37.402016 IP localhost.domain > localhost.39124: 33907* 1/0/2 SRV localslave.marathon.mesos.:13903 20 25344 (124)

The ‘show stat’ CLI shows the old port:
health_testapp,testapp_health1,0,0,0,0,64,0,0,0,0,0,0,0,0,DOWN,100,1,0,0,1,138,138,1,8,1,0,2,0,0,L4CON,0,0,0,0,0,0,0,0,0,-1,Connection refused,0,0,0,0,Layer4 connection problem,3,2,0,127.0.0.1:24379,http,
health_testapp,BACKEND,0,0,0,1,200,70,4760,14840,0,0,70,0,0,0,DOWN,0,0,0,1,138,138,1,8,0,0,1,1,1,0,0,0,0,70,0,70,0,0,0,0,0,0,-1,0,0,0,0,http,roundrobin,

The server-state-file shows the old port:
8 health_testapp 1 testapp_health1 127.0.0.1 0 0 100 1 201 8 2 0 6 0 0 0 localslave.marathon.mesos 24379 _testapp_http._tcp.marathon.mesos

server-template config is:
server-template testapp_health 1 _testapp_http._tcp.marathon.mesos resolvers localdns resolve-prefer ipv4 maxconn 64 rise 3 fall 2 check inter 10000

I tried using ‘resolve-opts allow-dup-ip’ but it didn’t help. It seems like that slot is permanently stuck for some reason? Some race between the server-state-file reload and DNS updates?

Any workaround or fix would be appreciated.

Thanks,
Steve


#2

Any ideas?


#3

The host always remains localhost, only the port changes, right? Can you provide the the configuration and the output of haproxy -vv.

Can you also provide the output of show servers state as well as show stat resolvers?

@Baptiste any idea how we could troubleshoot this?


#4

(for some reasons, I can’t get logged in into discourse… trying to answer by email).

An other important info that would be required are HAProxy “administrative” logs (not the traffic ones) because HAProxy should generate a log line each time it can’t find a good match between the state file and the current configuration.

There must be a moment in time where a mismatch happens (for some reasons) and I want to understand how/why.

Baptiste


#5

Thanks for the replies. I’m trying to reproduce the issue again to get the admin logs but here’s the haproxy -vv and sample config. You’re correct in this case the DNS will always return a localhost record though the port may change. We have a couple hundred backends and at any point backends can be added/removed (which triggers a config reload via the server-state-file) and/or DNS can return different ports. I think I have seen those mismatch lines in the log files and I’m trying to get some examples.

HA-Proxy version 1.8.14-52e4d43 2018/09/20
    Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

    Build options :
      TARGET  = linux2628
      CPU     = i686
      CC      = gcc
      CFLAGS  = -m64 -march=opteron -mno-3dnow -ggdb -O2 -Wall -I=/workspace/common/include -L=/workspace/common/lib
      OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_STATIC_PCRE=1

    Default settings :
      maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

    Built with OpenSSL version : OpenSSL 1.0.2l  25 May 2017
    Running on OpenSSL version : OpenSSL 1.0.2p-fips  14 Aug 2018 (VERSIONS DIFFER!)
    OpenSSL library supports TLS extensions : yes
    OpenSSL library supports SNI : yes
    OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
    Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
    Encrypted password support via crypt(3): yes
    Built with multi-threading support.
    Built with PCRE version : 8.38 2015-11-23
    Running on PCRE version : 8.38 2015-11-23
    PCRE library supports JIT : no (USE_PCRE_JIT not set)
    Built with zlib version : 1.2.8
    Running on zlib version : 1.2.8
    Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
    Built with network namespace support.

    Available polling systems :
          epoll : pref=300,  test result OK
           poll : pref=200,  test result OK
         select : pref=150,  test result OK
    Total: 3 (3 usable), will use epoll.

    Available filters :
    	[SPOE] spoe
    	[COMP] compression
    	[TRACE] trace

Example config

global
	maxconn 5000
	nbproc 1
	log 127.0.0.1 local1
    log 127.0.0.1 local2 notice
	pidfile /usr/local/test/etc/test_haproxy.pid
   server-state-file /usr/local/test/test-haproxy/conf/server-state
	stats socket /var/run/haproxy.sock mode 600 level admin
	stats timeout 2m

defaults
	log global
	mode http

    load-server-state-from-file global

    log-format %ci:%cp\ [%t]\ %ft\ %b/%s\ %Tq/%Tw/%Tc/%Tr/%Tt\ %ST\ %B\ %CC\ %CS\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %hr\ %hs\ L/%ID\ [%[src,map(/usr/local/test/etc/name-mapping.lst)]\ %[src,map(/usr/local/test/etc/id-mapping.lst)]]\ %{+Q}r

	retries 2
    timeout connect 4s
    timeout client 10s
    timeout server 180s
    timeout check 8s
    timeout http-keep-alive 91s
    timeout http-request 5s

    option clitcpka
	option abortonclose
	option forwardfor

	balance roundrobin

	option forwardfor except 127.0.0.1
	option log-health-checks

	default-server init-addr last,libc,none

  	monitor-uri /this-is-health-check/status

resolvers localdns
    accepted_payload_size 8192
    nameserver dns1 localhost:53
    resolve_retries       3
    timeout resolve       3s
    timeout retry         3s
    hold other           30s
    hold refused         30s
    hold nx              30s
    hold timeout         30s
    hold valid           10s
    hold obsolete        30s

frontend http-in
	bind *:80
	bind *:443 ssl crt /usr/local/test/test-haproxy/conf/secrets/proxysslcert.pem

	acl acl_test-java path_beg /test-java/
	use_backend backend_test-java if acl_test-java	
	
	
backend backend_test-java
	balance roundrobin
	http-request del-header Proxy
	option http-server-close
	option httpchk GET /test-java/system/ping HTTP/1.1
	server-template test-java_health 1 _test-java_http._tcp.marathon.mesos resolvers localdns resolve-prefer ipv4 maxconn 64 rise 3 fall 2 check inter 10000

#6

In the state my test system is in now I can restart HAProxy and it remains misconfigured. I don’t see any of the mismatch lines since it looks happy to apply the old port to the backend it finds. It seems once it reloads from the server-state file that contains the incorrect port it doesn’t take any updates. Here is what is in the admin log, show servers state, show stat resolvers, tcpdump of DNS. I guess since I can reproduce it after a restart I could add more debug logging to try to figure out why it ignores the result from DNS?

Feb 11 20:36:55 192.168.64.163 haproxy[29451]:  Proxy health_system started.
Feb 11 20:36:56 192.168.64.163 haproxy[29492]:  Stopping backend health_system in 0 ms.
Feb 11 20:36:56 192.168.64.163 haproxy[29492]:  Proxy health_system stopped (FE: 0 conns, BE: 9 conns).
Feb 11 20:36:56 192.168.64.163 haproxy[29888]:  Server health_system/system_health1 is DOWN, changed from server-state after a reload. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Feb 11 20:36:56 192.168.64.163 haproxy[29888]:  backend health_system has no server available!

show servers state

10 health_system 1 system_health1 127.0.0.1 0 0 100 1 12 8 2 0 6 0 0 0 localslave.marathon.mesos 22630 _system_http._tcp.marathon.mesos

show stat resolvers

health_system,system_health1,0,0,0,0,64,0,0,0,,0,,0,0,0,0,DOWN,100,1,0,0,1,11,11,,1,10,1,,0,,2,0,,0,L4CON,,0,0,0,0,0,0,0,,,,,0,0,,,,,-1,Connection refused,,0,0,0,0,,,,Layer4 connection problem,,3,2,0,,,,127.0.0.1:22630,,http,,,,,,,,
health_system,BACKEND,0,0,0,1,200,5,340,1060,0,0,,5,0,0,0,DOWN,0,0,0,,1,11,11,,1,10,0,,0,,1,1,,2,,,,0,0,0,0,5,0,,,,5,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,http,roundrobin,,,,,,,

tcpdump of DNS

20:41:22.786430 IP localhost.33285 > localhost.domain: 54776+ [1au] SRV? _system_http._tcp.marathon.mesos. (63)
20:41:22.786533 IP localhost.domain > localhost.33285: 54776* 1/0/2 SRV localslave.marathon.mesos.:12523 20 25344 (124)