I’m attempting to use HAProxy Resolvers along with SRV Records and server-template to allow services on dynamic ports to register with HAProxy.
I’m using AWS Service Discovery (with Route53, TTL: 10s) and ECS.
It works successfully, given enough time, and any services in the DNS record eventually become available backends.
If I have 2 containers running for a service, with 4 defined using server-template, then the first 2 will be “green” and the second two will be “red”.
During an HA deployment, where the 2 containers are replaced 1 by 1, HAProxy fails to register the updated records in time to prevent an outage.
So e.g. during a deployment, you might have an SRV record with 2 results:
- A._foo.my.service
- B._foo.my.service
as the first container (A) is stopped, the SRV record only returns 1 result:
- B._foo.my.service
at this point, I would expect HAProxy to remove the server from the server list, and it would appear “red” similar to other servers that were missing when the service started
However, instead, the server ends up marked as “MAINT” (orange), due to “resolution”, and will sit “stuck” for up to 5+ minutes sometimes, failing to acquire the new IP information.
Meanwhile, the SRV record is updated again as the services are replaced/updated:
- B._foo.my.service
- C._foo.my.service
then again as B is removed:
- C._foo.my.service
and finally D is added:
- C._foo.my.service
- D._foo.my.service
This whole time, performing a dig SRV _foo_.my.service @{DNS_IP}
on the haproxy host IMMEDIATELY resolves the correct service IPs and Ports as each of the above deployment steps happens. So the issue isn’t with upstream DNS being up-to-date.
This makes the SRV system basically useless to me currently, as even with a rolling deployment with HA services, I end up with an outage.
I have 2 HAProxy servers and the behavior is not identical between them, either (even though they’re identically configured).
Whether one of the server entries stays in “MAINT” for long seems to vary between them.
Eventually, it ends up resolving – but having to wait 5+ minutes and having the services go completely unavailable (even though they’re up, dns is updated, and they’re ready to receive traffic) is not adequate for production usage.
here’s a sanitized and trimmed config excerpt:
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
# Default SSL material locations
ca-base /etc/ssl/certs
crt-base /etc/ssl/private
# See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate
ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
spread-checks 5
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
option httpclose
monitor-uri /elb-check
maxconn 60000
rate-limit sessions 100
backlog 60000
resolvers aws-sd
accepted_payload_size 8192
hold valid 5s # keep valid answer for up to 5s
nameserver aws-sd1
listen stats
mode http
stats enable
stats uri /stats
stats realm HAProxy\ Statistics
frontend HTTP_IN
capture request header User-Agent len 200
capture request header Host len 54
capture request header Origin len 54
capture request header X-Forwarded-For len 35
capture request header X-Forwarded-Proto len 5
capture response header status len 3
option http-server-close
option forwardfor except #sanitized#
option forwardfor except #sanitized#
# environments
acl dev hdr_beg(host) #sanitized#. #sanitized#.
# web-services routes
acl locations path_beg /locations
# dev backend
use_backend DEV_HOME if dev !locations
use_backend DEV_LOCATIONS if dev locations
backend DEV_HOME
balance roundrobin
option httpchk GET /healthcheck
http-check expect status 200
default-server inter 10s downinter 2s fastinter 2s rise 5 fall 2
server-template web 4 _http._tcp.web-service-home-dev-web.my.service resolvers aws-sd check init-addr none resolve-opts allow-dup-ip resolve-prefer ipv4
balance roundrobin
option httpchk GET /locations/healthcheck
http-check expect status 200
default-server inter 10s downinter 2s fastinter 2s rise 5 fall 2
server-template web 4 _http._tcp.web-service-locations-dev-web.my.service resolvers aws-sd check init-addr none resolve-opts allow-dup-ip resolve-prefer ipv4