DNS SRV - Long Delay Between Resolving Record and Backend Marked UP/READY

fhda-mrapczynski · July 9, 2019, 3:15pm

Attempting my first use of the DNS SRV features in a non-production environment in AWS.

In general, my configuration works. What I am trying to diagnose now is a long delay of ~5 minutes when I re-deploy a service. HAProxy picks up the DNS change quickly, but seems to take several minutes before that change is completely applied to the backend, and health checks resume. This delay defeats the purpose of dynamic service discovery.

The log events below show what HAProxy reports after the service has been redeployed.

Initially health checks fail because the container has restarted or moved to another host
At 07:51 the updated SRV record is identified
At 07:56 HAProxy finally marks the backend up

The key issue to diagnose is what causes the 5 minute waiting time.

Also, the DNS records currently have a TTL of 0.

Log Entries:

Jul 09 07:50:59 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Health check for server bcm_test/bcm1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 2/3 UP.
Jul 09 07:51:04 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Health check for server bcm_test/bcm1 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.

Broadcast message from systemd-journald@haproxy-internal-20190703-1136AM-579 (Tue 2019-07-09 07:51:19 PDT):

haproxy[21084]: backend bcm_test has no server available!

Jul 09 07:51:19 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Server bcm_test/bcm1 is going DOWN for maintenance (DNS NX status). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jul 09 07:51:19 haproxy-internal-20190703-1136AM-579 haproxy[21084]: backend bcm_test has no server available!
Jul 09 07:51:19 haproxy-internal-20190703-1136AM-579 haproxy[21084]: bcm_test/bcm1 changed its FQDN from (null) to 18571aeca1584999960dff1a17acf787._bcmmb-test._tcp.docker by 'SRV record'
Jul 09 07:56:21 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Server bcm_test/bcm1 ('18571aeca1584999960dff1a17acf787._bcmmb-test._tcp.docker') is UP/READY (resolves again).
Jul 09 07:56:21 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Server bcm_test/bcm1 administratively READY thanks to valid DNS answer.
Jul 09 07:56:24 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Health check for server bcm_test/bcm1 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.

HAProxy Configuration:

resolvers aws
    nameserver vpc 169.254.169.253:53
    resolve_retries 256
    timeout retry 5s
    timeout resolve 2s
    hold nx 15s
    hold other 15s
    hold refused 15s
    hold timeout 15s
    hold valid 15s
    hold obsolete 15s

frontend bcm_test
    bind *:32001
    timeout client 30m
    timeout client-fin 30s
    default_backend bcm_test

backend bcm_test
    default-server check inter 5s rise 6
    timeout connect 10s
    timeout server 30m
    server-template bcm 1 _bcmmb-test._tcp.docker check resolvers aws resolve-opts allow-dup-ip

Baptiste · July 11, 2019, 6:46am

Hi,
I wonder if your issue comes from your very high resolve_retries.
Try to set it up to 3 only

eedwards-sk · September 20, 2019, 3:40pm

I’m having the same issues – resolve_retries is the low default of 3, in my case.

They get marked as “MAINT” for “Resolution”, even though a dig on the same machine immediately results in the correct SRV records.

Baptiste · September 24, 2019, 12:52pm

I saw your other long post. Let’s carry on in it.

seb176 · October 5, 2020, 3:57pm

Exact same issue for me.
With Haproxy 2.2.1

Topic		Replies	Views
HAProxy 2.0.5 often fails to quickly update SRV records Help!	8	1026	October 13, 2020
Service Discovery using DNS and SRV，Is it good on 2.4？ Help!	1	550	October 19, 2021
DNS Server State cleanup Help!	0	370	July 19, 2022
Config for Service Discovery using DNS and SRV records? Help!	3	4404	December 3, 2018
DNS Resolution Sigh v1.7.1 Help!	6	2344	January 31, 2017

DNS SRV - Long Delay Between Resolving Record and Backend Marked UP/READY

Related topics