DNS SRV - Long Delay Between Resolving Record and Backend Marked UP/READY

Attempting my first use of the DNS SRV features in a non-production environment in AWS.

In general, my configuration works. What I am trying to diagnose now is a long delay of ~5 minutes when I re-deploy a service. HAProxy picks up the DNS change quickly, but seems to take several minutes before that change is completely applied to the backend, and health checks resume. This delay defeats the purpose of dynamic service discovery.

The log events below show what HAProxy reports after the service has been redeployed.

  1. Initially health checks fail because the container has restarted or moved to another host
  2. At 07:51 the updated SRV record is identified
  3. At 07:56 HAProxy finally marks the backend up

The key issue to diagnose is what causes the 5 minute waiting time.

Also, the DNS records currently have a TTL of 0.

Log Entries:

Jul 09 07:50:59 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Health check for server bcm_test/bcm1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 2/3 UP.
Jul 09 07:51:04 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Health check for server bcm_test/bcm1 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.

Broadcast message from systemd-journald@haproxy-internal-20190703-1136AM-579 (Tue 2019-07-09 07:51:19 PDT):

haproxy[21084]: backend bcm_test has no server available!

Jul 09 07:51:19 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Server bcm_test/bcm1 is going DOWN for maintenance (DNS NX status). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jul 09 07:51:19 haproxy-internal-20190703-1136AM-579 haproxy[21084]: backend bcm_test has no server available!
Jul 09 07:51:19 haproxy-internal-20190703-1136AM-579 haproxy[21084]: bcm_test/bcm1 changed its FQDN from (null) to 18571aeca1584999960dff1a17acf787._bcmmb-test._tcp.docker by 'SRV record'
Jul 09 07:56:21 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Server bcm_test/bcm1 ('18571aeca1584999960dff1a17acf787._bcmmb-test._tcp.docker') is UP/READY (resolves again).
Jul 09 07:56:21 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Server bcm_test/bcm1 administratively READY thanks to valid DNS answer.
Jul 09 07:56:24 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Health check for server bcm_test/bcm1 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.

HAProxy Configuration:

resolvers aws
    nameserver vpc 169.254.169.253:53
    resolve_retries 256
    timeout retry 5s
    timeout resolve 2s
    hold nx 15s
    hold other 15s
    hold refused 15s
    hold timeout 15s
    hold valid 15s
    hold obsolete 15s

frontend bcm_test
    bind *:32001
    timeout client 30m
    timeout client-fin 30s
    default_backend bcm_test

backend bcm_test
    default-server check inter 5s rise 6
    timeout connect 10s
    timeout server 30m
    server-template bcm 1 _bcmmb-test._tcp.docker check resolvers aws resolve-opts allow-dup-ip

Hi,
I wonder if your issue comes from your very high resolve_retries.
Try to set it up to 3 only

I’m having the same issues – resolve_retries is the low default of 3, in my case.

They get marked as “MAINT” for “Resolution”, even though a dig on the same machine immediately results in the correct SRV records.

I saw your other long post. Let’s carry on in it.

Exact same issue for me.
With Haproxy 2.2.1