Attempting my first use of the DNS SRV features in a non-production environment in AWS.
In general, my configuration works. What I am trying to diagnose now is a long delay of ~5 minutes when I re-deploy a service. HAProxy picks up the DNS change quickly, but seems to take several minutes before that change is completely applied to the backend, and health checks resume. This delay defeats the purpose of dynamic service discovery.
The log events below show what HAProxy reports after the service has been redeployed.
- Initially health checks fail because the container has restarted or moved to another host
- At 07:51 the updated SRV record is identified
- At 07:56 HAProxy finally marks the backend up
The key issue to diagnose is what causes the 5 minute waiting time.
Also, the DNS records currently have a TTL of 0.
Log Entries:
Jul 09 07:50:59 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Health check for server bcm_test/bcm1 failed, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms, status: 2/3 UP.
Jul 09 07:51:04 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Health check for server bcm_test/bcm1 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
Broadcast message from systemd-journald@haproxy-internal-20190703-1136AM-579 (Tue 2019-07-09 07:51:19 PDT):
haproxy[21084]: backend bcm_test has no server available!
Jul 09 07:51:19 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Server bcm_test/bcm1 is going DOWN for maintenance (DNS NX status). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jul 09 07:51:19 haproxy-internal-20190703-1136AM-579 haproxy[21084]: backend bcm_test has no server available!
Jul 09 07:51:19 haproxy-internal-20190703-1136AM-579 haproxy[21084]: bcm_test/bcm1 changed its FQDN from (null) to 18571aeca1584999960dff1a17acf787._bcmmb-test._tcp.docker by 'SRV record'
Jul 09 07:56:21 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Server bcm_test/bcm1 ('18571aeca1584999960dff1a17acf787._bcmmb-test._tcp.docker') is UP/READY (resolves again).
Jul 09 07:56:21 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Server bcm_test/bcm1 administratively READY thanks to valid DNS answer.
Jul 09 07:56:24 haproxy-internal-20190703-1136AM-579 haproxy[21084]: Health check for server bcm_test/bcm1 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
HAProxy Configuration:
resolvers aws
nameserver vpc 169.254.169.253:53
resolve_retries 256
timeout retry 5s
timeout resolve 2s
hold nx 15s
hold other 15s
hold refused 15s
hold timeout 15s
hold valid 15s
hold obsolete 15s
frontend bcm_test
bind *:32001
timeout client 30m
timeout client-fin 30s
default_backend bcm_test
backend bcm_test
default-server check inter 5s rise 6
timeout connect 10s
timeout server 30m
server-template bcm 1 _bcmmb-test._tcp.docker check resolvers aws resolve-opts allow-dup-ip