HAProxy 2.0.5 often fails to quickly update SRV records

I’m attempting to use HAProxy Resolvers along with SRV Records and server-template to allow services on dynamic ports to register with HAProxy.

I’m using AWS Service Discovery (with Route53, TTL: 10s) and ECS.

It works successfully, given enough time, and any services in the DNS record eventually become available backends.


If I have 2 containers running for a service, with 4 defined using server-template, then the first 2 will be “green” and the second two will be “red”.

During an HA deployment, where the 2 containers are replaced 1 by 1, HAProxy fails to register the updated records in time to prevent an outage.

So e.g. during a deployment, you might have an SRV record with 2 results:

_foo_.my.service:

  - A._foo.my.service
  - B._foo.my.service

as the first container (A) is stopped, the SRV record only returns 1 result:

_foo_.my.service:

  - B._foo.my.service

at this point, I would expect HAProxy to remove the server from the server list, and it would appear “red” similar to other servers that were missing when the service started

However, instead, the server ends up marked as “MAINT” (orange), due to “resolution”, and will sit “stuck” for up to 5+ minutes sometimes, failing to acquire the new IP information.

Meanwhile, the SRV record is updated again as the services are replaced/updated:

_foo_.my.service:

  - B._foo.my.service
  - C._foo.my.service

then again as B is removed:

_foo_.my.service:

  - C._foo.my.service

and finally D is added:

_foo_.my.service:

  - C._foo.my.service
  - D._foo.my.service

This whole time, performing a dig SRV _foo_.my.service @{DNS_IP} on the haproxy host IMMEDIATELY resolves the correct service IPs and Ports as each of the above deployment steps happens. So the issue isn’t with upstream DNS being up-to-date.

This makes the SRV system basically useless to me currently, as even with a rolling deployment with HA services, I end up with an outage.

I have 2 HAProxy servers and the behavior is not identical between them, either (even though they’re identically configured).

Whether one of the server entries stays in “MAINT” for long seems to vary between them.

Eventually, it ends up resolving – but having to wait 5+ minutes and having the services go completely unavailable (even though they’re up, dns is updated, and they’re ready to receive traffic) is not adequate for production usage.


here’s a sanitized and trimmed config excerpt:

global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
        stats timeout 30s
        user haproxy
        group haproxy
        daemon

        # Default SSL material locations
        ca-base /etc/ssl/certs
        crt-base /etc/ssl/private

        # See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate
        ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
        ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
        ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets

        spread-checks     5

defaults
        log           global
        mode          http
        option        httplog
        option        dontlognull
        timeout       connect 5000
        timeout       client  50000
        timeout       server  50000
        errorfile     400 /etc/haproxy/errors/400.http
        errorfile     403 /etc/haproxy/errors/403.http
        errorfile     408 /etc/haproxy/errors/408.http
        errorfile     500 /etc/haproxy/errors/500.http
        errorfile     502 /etc/haproxy/errors/502.http
        errorfile     503 /etc/haproxy/errors/503.http
        errorfile     504 /etc/haproxy/errors/504.http

        option        httpclose
        monitor-uri   /elb-check

        maxconn       60000
        rate-limit    sessions 100
        backlog       60000

resolvers aws-sd
        accepted_payload_size   8192
        hold valid              5s # keep valid answer for up to 5s
        nameserver aws-sd1      169.254.169.253:53

listen stats
        bind              0.0.0.0:9000
        mode              http
        balance
        stats             enable
        stats             uri /stats
        stats             realm HAProxy\ Statistics

frontend HTTP_IN
        bind              0.0.0.0:80
        capture           request header User-Agent len 200
        capture           request header Host len 54
        capture           request header Origin len 54
        capture           request header X-Forwarded-For len 35
        capture           request header X-Forwarded-Proto len 5
        capture           response header status len 3
        option            http-server-close
        option            forwardfor except #sanitized#
        option            forwardfor except #sanitized#

        # environments
        acl               dev        hdr_beg(host)  #sanitized#. #sanitized#.

        # web-services routes
        acl               locations         path_beg /locations

        # dev backend
        use_backend       DEV_HOME if dev !locations
        use_backend       DEV_LOCATIONS if dev locations

backend DEV_HOME
        balance roundrobin
        option httpchk GET /healthcheck
        http-check expect status 200
        default-server inter 10s downinter 2s fastinter 2s rise 5 fall 2
        server-template web 4 _http._tcp.web-service-home-dev-web.my.service resolvers aws-sd check init-addr none resolve-opts allow-dup-ip resolve-prefer ipv4

backend DEV_LOCATIONS
        balance roundrobin
        option httpchk GET /locations/healthcheck
        http-check expect status 200
        default-server inter 10s downinter 2s fastinter 2s rise 5 fall 2
        server-template web 4 _http._tcp.web-service-locations-dev-web.my.service resolvers aws-sd check init-addr none resolve-opts allow-dup-ip resolve-prefer ipv4

Just look at this… stuck 2 minutes in “MAINT”, meanwhile 2 valid DNS SRV records are available on the same host:

; <<>> DiG 9.11.3-1ubuntu1.8-Ubuntu <<>> SRV _http._tcp.web-service-locations-dev-web.my.service @169.254.169.253
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1390
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;_http._tcp.web-service-locations-dev-web.my.service. IN SRV

;; ANSWER SECTION:
_http._tcp.web-service-locations-dev-web.my.service. 10 IN SRV 1 1 32781 8d80f255924f4442a2c27a4198b289e4._http._tcp.web-service-locations-dev-web.my.service.
_http._tcp.web-service-locations-dev-web.my.service. 10 IN SRV 1 1 32793 2de69cd515154b02a0dd9d35ac13dd74._http._tcp.web-service-locations-dev-web.my.service.

;; Query time: 1 msec
;; SERVER: 169.254.169.253#53(169.254.169.253)
;; WHEN: Fri Sep 20 17:14:16 UTC 2019
;; MSG SIZE  rcvd: 307

5 mins stuck in “MAINT”…

still 2 valid SRV records (deployment finished):

; <<>> DiG 9.11.3-1ubuntu1.8-Ubuntu <<>> SRV _http._tcp.web-service-locations-dev-web.my.service @169.254.169.253
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8745
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;_http._tcp.web-service-locations-dev-web.my.service. IN SRV

;; ANSWER SECTION:
_http._tcp.web-service-locations-dev-web.my.service. 10 IN SRV 1 1 32782 9b8c065cb88c4cd494a37e9d8a9038fa._http._tcp.web-service-locations-dev-web.my.service.
_http._tcp.web-service-locations-dev-web.my.service. 10 IN SRV 1 1 32793 2de69cd515154b02a0dd9d35ac13dd74._http._tcp.web-service-locations-dev-web.my.service.

;; Query time: 3 msec
;; SERVER: 169.254.169.253#53(169.254.169.253)
;; WHEN: Fri Sep 20 17:17:45 UTC 2019
;; MSG SIZE  rcvd: 307

Meanwhile, the other HAProxy node is happy, and handled the “MAINT” situation faster, for no apparent reason. It’s still totally non-deterministic.

I did another test and confirmed that when the entry for one of the backends goes away (from the response on the parent SRV record), HAProxy is retaining the ip and port inside the config for some reason.

echo 'show servers state' | sudo nc -U /run/haproxy/admin.sock:

5 DEV_LOCATIONS 1 web1 172.31.79.188 2 0 1 1 408 1 0 2 0 0 0 0 - 32793 _http._tcp.web-service-locations-dev-web.my.service
5 DEV_LOCATIONS 2 web2 172.31.36.137 2 0 1 1 408 1 0 2 0 0 0 0 9b8c065cb88c4cd494a37e9d8a9038fa._http._tcp.web-service-locations-dev-web.my.service 32782 _http._tcp.web-service-locations-dev-web.my.service
5 DEV_LOCATIONS 3 web3 - 2 0 1 1 408 1 0 2 0 0 0 0 - 0 _http._tcp.web-service-locations-dev-web.my.service
5 DEV_LOCATIONS 4 web4 - 2 0 1 1 408 1 0 2 0 0 0 0 - 0 _http._tcp.web-service-locations-dev-web.my.service

This is with 2 containers, 2 initial SRV entries in the record, server-template 4, and then during a deployment the record starts returning only 1 entry.

Notice how web3 and web4 are as expected – no IP, no Port.

However, notice web1 – when the entry from the SRV record went away, for some reason, it has kept the IP 172.31.79.188 and port 32793.

This is confirmed not present in a DNS result:

; <<>> DiG 9.11.3-1ubuntu1.8-Ubuntu <<>> SRV _http._tcp.web-service-locations-dev-web.my.service
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 29036
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;_http._tcp.web-service-locations-dev-web.my.service. IN SRV

;; ANSWER SECTION:
_http._tcp.web-service-locations-dev-web.my.service. 7 IN SRV 1 1 32782 9b8c065cb88c4cd494a37e9d8a9038fa._http._tcp.web-service-locations-dev-web.my.service.

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Sep 20 20:29:31 UTC 2019
;; MSG SIZE  rcvd: 143

So why is HAProxy holding onto that ip/port for that server? It seems related to it getting stuck in “MAINT”.

Hi Eedward,

Thanks for your detailed report.
I will try to reproduce your issue and come back to you asap.

1 Like

I ended up having to implement a workaround, and I’m no longer using SRV records due to this issue.

I’d love to be able to go back to SRV records though. Let me know if I can help provide any additional detail.

Bumping this to see if anyone else has ran into this or solved it?

I’ve found other instances of people reporting this issue, as well, but no solutions or acknowledgment.

Exact same issue for me with Haproxy 2.2.1

Problem solved since the September, 14th 2020.
The issue came from AWS. I had to re-create my namespace.

Today AWS Cloud Map released new default values for negative DNS caching. Now it is 15 seconds for namespaces with private DNS resolution (instead of 300 seconds) and 60 seconds for namespaces with public DNS resolution (instead of 900 seconds). These new defaults apply to new namespaces, created on or after 09/14/2020 only. We continue working on enabling modification of negative DNS caching settings for all existing namespaces as well.