HAProxy 2.0.5 often fails to quickly update SRV records

eedwards-sk · September 20, 2019, 4:23pm

I’m attempting to use HAProxy Resolvers along with SRV Records and server-template to allow services on dynamic ports to register with HAProxy.

I’m using AWS Service Discovery (with Route53, TTL: 10s) and ECS.

It works successfully, given enough time, and any services in the DNS record eventually become available backends.

If I have 2 containers running for a service, with 4 defined using server-template, then the first 2 will be “green” and the second two will be “red”.

During an HA deployment, where the 2 containers are replaced 1 by 1, HAProxy fails to register the updated records in time to prevent an outage.

So e.g. during a deployment, you might have an SRV record with 2 results:

_foo_.my.service:

  - A._foo.my.service
  - B._foo.my.service

as the first container (A) is stopped, the SRV record only returns 1 result:

_foo_.my.service:

  - B._foo.my.service

at this point, I would expect HAProxy to remove the server from the server list, and it would appear “red” similar to other servers that were missing when the service started

However, instead, the server ends up marked as “MAINT” (orange), due to “resolution”, and will sit “stuck” for up to 5+ minutes sometimes, failing to acquire the new IP information.

Meanwhile, the SRV record is updated again as the services are replaced/updated:

_foo_.my.service:

  - B._foo.my.service
  - C._foo.my.service

then again as B is removed:

_foo_.my.service:

  - C._foo.my.service

and finally D is added:

_foo_.my.service:

  - C._foo.my.service
  - D._foo.my.service

This whole time, performing a dig SRV _foo_.my.service @{DNS_IP} on the haproxy host IMMEDIATELY resolves the correct service IPs and Ports as each of the above deployment steps happens. So the issue isn’t with upstream DNS being up-to-date.

This makes the SRV system basically useless to me currently, as even with a rolling deployment with HA services, I end up with an outage.

I have 2 HAProxy servers and the behavior is not identical between them, either (even though they’re identically configured).

Whether one of the server entries stays in “MAINT” for long seems to vary between them.

Eventually, it ends up resolving – but having to wait 5+ minutes and having the services go completely unavailable (even though they’re up, dns is updated, and they’re ready to receive traffic) is not adequate for production usage.

here’s a sanitized and trimmed config excerpt:

global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
        stats timeout 30s
        user haproxy
        group haproxy
        daemon

        # Default SSL material locations
        ca-base /etc/ssl/certs
        crt-base /etc/ssl/private

        # See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate
        ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
        ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
        ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets

        spread-checks     5

defaults
        log           global
        mode          http
        option        httplog
        option        dontlognull
        timeout       connect 5000
        timeout       client  50000
        timeout       server  50000
        errorfile     400 /etc/haproxy/errors/400.http
        errorfile     403 /etc/haproxy/errors/403.http
        errorfile     408 /etc/haproxy/errors/408.http
        errorfile     500 /etc/haproxy/errors/500.http
        errorfile     502 /etc/haproxy/errors/502.http
        errorfile     503 /etc/haproxy/errors/503.http
        errorfile     504 /etc/haproxy/errors/504.http

        option        httpclose
        monitor-uri   /elb-check

        maxconn       60000
        rate-limit    sessions 100
        backlog       60000

resolvers aws-sd
        accepted_payload_size   8192
        hold valid              5s # keep valid answer for up to 5s
        nameserver aws-sd1      169.254.169.253:53

listen stats
        bind              0.0.0.0:9000
        mode              http
        balance
        stats             enable
        stats             uri /stats
        stats             realm HAProxy\ Statistics

frontend HTTP_IN
        bind              0.0.0.0:80
        capture           request header User-Agent len 200
        capture           request header Host len 54
        capture           request header Origin len 54
        capture           request header X-Forwarded-For len 35
        capture           request header X-Forwarded-Proto len 5
        capture           response header status len 3
        option            http-server-close
        option            forwardfor except #sanitized#
        option            forwardfor except #sanitized#

        # environments
        acl               dev        hdr_beg(host)  #sanitized#. #sanitized#.

        # web-services routes
        acl               locations         path_beg /locations

        # dev backend
        use_backend       DEV_HOME if dev !locations
        use_backend       DEV_LOCATIONS if dev locations

backend DEV_HOME
        balance roundrobin
        option httpchk GET /healthcheck
        http-check expect status 200
        default-server inter 10s downinter 2s fastinter 2s rise 5 fall 2
        server-template web 4 _http._tcp.web-service-home-dev-web.my.service resolvers aws-sd check init-addr none resolve-opts allow-dup-ip resolve-prefer ipv4

backend DEV_LOCATIONS
        balance roundrobin
        option httpchk GET /locations/healthcheck
        http-check expect status 200
        default-server inter 10s downinter 2s fastinter 2s rise 5 fall 2
        server-template web 4 _http._tcp.web-service-locations-dev-web.my.service resolvers aws-sd check init-addr none resolve-opts allow-dup-ip resolve-prefer ipv4

eedwards-sk · September 20, 2019, 5:16pm

Just look at this… stuck 2 minutes in “MAINT”, meanwhile 2 valid DNS SRV records are available on the same host:

; <<>> DiG 9.11.3-1ubuntu1.8-Ubuntu <<>> SRV _http._tcp.web-service-locations-dev-web.my.service @169.254.169.253
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1390
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;_http._tcp.web-service-locations-dev-web.my.service. IN SRV

;; ANSWER SECTION:
_http._tcp.web-service-locations-dev-web.my.service. 10 IN SRV 1 1 32781 8d80f255924f4442a2c27a4198b289e4._http._tcp.web-service-locations-dev-web.my.service.
_http._tcp.web-service-locations-dev-web.my.service. 10 IN SRV 1 1 32793 2de69cd515154b02a0dd9d35ac13dd74._http._tcp.web-service-locations-dev-web.my.service.

;; Query time: 1 msec
;; SERVER: 169.254.169.253#53(169.254.169.253)
;; WHEN: Fri Sep 20 17:14:16 UTC 2019
;; MSG SIZE  rcvd: 307

eedwards-sk · September 20, 2019, 5:19pm

5 mins stuck in “MAINT”…

still 2 valid SRV records (deployment finished):

; <<>> DiG 9.11.3-1ubuntu1.8-Ubuntu <<>> SRV _http._tcp.web-service-locations-dev-web.my.service @169.254.169.253
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8745
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;_http._tcp.web-service-locations-dev-web.my.service. IN SRV

;; ANSWER SECTION:
_http._tcp.web-service-locations-dev-web.my.service. 10 IN SRV 1 1 32782 9b8c065cb88c4cd494a37e9d8a9038fa._http._tcp.web-service-locations-dev-web.my.service.
_http._tcp.web-service-locations-dev-web.my.service. 10 IN SRV 1 1 32793 2de69cd515154b02a0dd9d35ac13dd74._http._tcp.web-service-locations-dev-web.my.service.

;; Query time: 3 msec
;; SERVER: 169.254.169.253#53(169.254.169.253)
;; WHEN: Fri Sep 20 17:17:45 UTC 2019
;; MSG SIZE  rcvd: 307

Meanwhile, the other HAProxy node is happy, and handled the “MAINT” situation faster, for no apparent reason. It’s still totally non-deterministic.

eedwards-sk · September 20, 2019, 8:39pm

I did another test and confirmed that when the entry for one of the backends goes away (from the response on the parent SRV record), HAProxy is retaining the ip and port inside the config for some reason.

echo 'show servers state' | sudo nc -U /run/haproxy/admin.sock:

5 DEV_LOCATIONS 1 web1 172.31.79.188 2 0 1 1 408 1 0 2 0 0 0 0 - 32793 _http._tcp.web-service-locations-dev-web.my.service
5 DEV_LOCATIONS 2 web2 172.31.36.137 2 0 1 1 408 1 0 2 0 0 0 0 9b8c065cb88c4cd494a37e9d8a9038fa._http._tcp.web-service-locations-dev-web.my.service 32782 _http._tcp.web-service-locations-dev-web.my.service
5 DEV_LOCATIONS 3 web3 - 2 0 1 1 408 1 0 2 0 0 0 0 - 0 _http._tcp.web-service-locations-dev-web.my.service
5 DEV_LOCATIONS 4 web4 - 2 0 1 1 408 1 0 2 0 0 0 0 - 0 _http._tcp.web-service-locations-dev-web.my.service

This is with 2 containers, 2 initial SRV entries in the record, server-template 4, and then during a deployment the record starts returning only 1 entry.

Notice how web3 and web4 are as expected – no IP, no Port.

However, notice web1 – when the entry from the SRV record went away, for some reason, it has kept the IP 172.31.79.188 and port 32793.

This is confirmed not present in a DNS result:

; <<>> DiG 9.11.3-1ubuntu1.8-Ubuntu <<>> SRV _http._tcp.web-service-locations-dev-web.my.service
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 29036
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;_http._tcp.web-service-locations-dev-web.my.service. IN SRV

;; ANSWER SECTION:
_http._tcp.web-service-locations-dev-web.my.service. 7 IN SRV 1 1 32782 9b8c065cb88c4cd494a37e9d8a9038fa._http._tcp.web-service-locations-dev-web.my.service.

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Sep 20 20:29:31 UTC 2019
;; MSG SIZE  rcvd: 143

So why is HAProxy holding onto that ip/port for that server? It seems related to it getting stuck in “MAINT”.

Baptiste · September 24, 2019, 12:48pm

Hi Eedward,

Thanks for your detailed report.
I will try to reproduce your issue and come back to you asap.

eedwards-sk · October 2, 2019, 3:36pm

I ended up having to implement a workaround, and I’m no longer using SRV records due to this issue.

I’d love to be able to go back to SRV records though. Let me know if I can help provide any additional detail.

eedwards-sk · October 24, 2019, 10:54pm

Bumping this to see if anyone else has ran into this or solved it?

I’ve found other instances of people reporting this issue, as well, but no solutions or acknowledgment.

seb176 · October 6, 2020, 8:15am

Exact same issue for me with Haproxy 2.2.1

seb176 · October 13, 2020, 3:09pm

Problem solved since the September, 14th 2020.
The issue came from AWS. I had to re-create my namespace.

Today AWS Cloud Map released new default values for negative DNS caching. Now it is 15 seconds for namespaces with private DNS resolution (instead of 300 seconds) and 60 seconds for namespaces with public DNS resolution (instead of 900 seconds). These new defaults apply to new namespaces, created on or after 09/14/2020 only. We continue working on enabling modification of negative DNS caching settings for all existing namespaces as well.

github.com/aws/aws-app-mesh-roadmap

Bug: Cloud Map DNS negative caching TTL cannot be changed

opened 06:26PM - 26 Jun 20 UTC

closed 11:20PM - 22 Jul 21 UTC

bcelenza

Bug Cloud Map

**Note: This bug affects customers who use Cloud Map with App Mesh. Although it …is not an App Mesh-specific issue, the bug has been reported in this repository for visibility and resolution tracking.** **Summary** When using a Cloud Map namespace that's associated with a public or private Route 53 hosted zone, the negative caching TTL is not adjustable. For public hosted zones, the TTL is 900 seconds. For private hosted zones with the use of the VPC DNS resolver (IP address ending in .2), the TTL is 300 seconds. See the [Route 53 documentation](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/SOA-NSrecords.html#SOArecords) for how this TTL is determined. The negative caching TTL is used whenever a requested DNS name contains no records. When this occurs, the DNS server responds with `NXDOMAIN`, and that value is cached for the time indicated above. In the event that all records are removed for a given DNS name, clients querying DNS will continue to see `NXDOMAIN` responses until the cache has expired. **Steps to Reproduce** 1. Create a Cloud Map namespace ([docs](https://docs.aws.amazon.com/cloud-map/latest/dg/creating-namespaces.html)) associated with a Route 53 private hosted zone. 2. Create a service entry in the Cloud Map namespace. 3. Register an instance to the service entry and give it an IP address. Set the TTL for this record to 10 seconds. 4. Using the `dig` utility on an EC2 instance within the VPC associated with the private hosted zone, query DNS for the record and assert that you receive the IP address you entered in step 3. (example command: `dig my-service.namespace.cluster.local`) 5. Remove the instance added in step 3 from Cloud Map 6. Repeat the query from step 4 until the record TTL (10 seconds) expires. Assert that you begin receiving an `NXDOMAIN` response. 7. Re-add the instance to Cloud Map. 8. Repeat the query from step 4, and assert that you continue experiencing the `NXDOMAIN` response until the negative caching TTL (300 seconds for a private hosted zone using the VPC DNS resolver) expires. **Are you currently working around this issue?** To mitigate the impact of the negative caching TTL resulting in no records being returned (and a potential outage): 1. Use a Route 53 private hosted zone and the VPC DNS resolver (IP address ending in .2). This ensures your negative caching TTL is 300 seconds (instead of 900 in a public hosted zone). 2. Ensure your services scale out before scaling in during deployments. This mitigates the possibility of all records being removed from the DNS record. See our [best practices documentation](https://docs.aws.amazon.com/app-mesh/latest/userguide/best-practices.html#scale-out) for details on this. 3. If you're only using Cloud Map with App Mesh (and specifying Cloud Map as the service discovery type on your Virtual Nodes), consider increasing the instance record TTL. This will help mitigate a temporal issue in which all records are removed from the DNS name. Since Envoy proxies managed by App Mesh do not use DNS to resolve IP address for Virtual Nodes that use Cloud Map, the increased TTL will help ensure your applications have a resolvable DNS name without impacting which endpoints traffic is routed to.

Topic		Replies	Views
There is a problem with the DNS SRV record process Help!	2	745	August 4, 2020
Haproxy 1.8.2; 1.8.3 DNS auto discover stop working Help!	20	3719	March 28, 2018
Server-template stops taking updates from DNS Help!	8	2760	November 26, 2019
DNS SRV - Long Delay Between Resolving Record and Backend Marked UP/READY Help!	4	1265	October 5, 2020
Server-template and randomized DNS responses Help!	3	2594	August 31, 2020

HAProxy 2.0.5 often fails to quickly update SRV records

Related topics