When a server’s IP changes during runtime, HAProxy does not resolve the hostname again when using external Python health check scripts. It will hold on to the old IP forever.
Here’s a timeline of events that triggers an issue for us:
HAProxy starts up and resolves the name of server.com to the IP 123.123.123.123. Backend is marked as UP.
The backend runs a python external health check script for the server every 10 seconds. Python resolves server.com by itself, does it’s tests against the server and keeps it UP.
Python health checks resolve to the new IP, run their tests and everything works fine. Server kept UP.
A request we want to route to server.com comes from a client to HAProxy. HAProxy still has the old 123.123.123.123 IP configured. The request is routed there and we get a 404 as our expected service is no longer there (but it’s still a valid IP with a response). The 404 is returned to the client.
Python health check runs again, resolves to the new IP again and passes the checks. Server kept UP.
We haven’t found a way to force HAProxy to resolve names again at set intervals. Instead it will hold on to valid IPs until either restarted or reloaded.
Do you know any approach we could utilize here to make HAProxy to re-resolve a hostname and take the new IP into use even if the old IP is still functional?
Running our own custom health check scripts is a hard requirement.
Here are some of our related HAProxy configuration snippets. We are using HAProxy 2.9.
resolvers default
parse-resolv-conf
hold other 15s
hold refused 15s
hold nx 15s
hold timeout 15s
hold valid 10s
hold obsolete 15s
backend server.com
option external-check
external-check command /health-check.py
server server.com server.com:443 init-addr libc,none
Thanks for the quick reply @lukastribus . I was under the assumption that a resolvers section named “default” would be “by default” enforced in the servers. Seems that was not the case.
I have now added the resolvers default line to the server line with init-addr none and do see the following in the logs:
[WARNING] (52) : server.com/server.com changed its IP from (none) to 103.18.17.221 by default/127.0.0.11.
[WARNING] (52) : Server server.com/server.com ('server.com') is UP/READY (resolves again).
[WARNING] (52) : Server server.com/server.com administratively READY thanks to valid DNS answer.
[WARNING] (52) : server.com/server.com changed its IP from (none) to 103.18.17.221 by DNS cache.
[WARNING] (52) : Server server.com/server.com ('server.com') is UP/READY (resolves again).
However, after HAProxy is running, to test things out, I’m manually overriding this in the etc/hosts file:
127.0.0.1 server.com
After saving the file, HAProxy never updates its IP for server.com to be 127.0.0.1. It remains the original 103.18.17.221 no matter how long I wait. If I keep sending requests to HAProxy, they all land on 103.18.17.221 as well.
If I restart HAProxy I get the following line:
server.com/server.com changed its IP from (none) to 103.18.17.221 by DNS cache.
I would like the cache to expire, or some other mechanism to trigger a re-resolve. Any suggestions?
Thanks @lukastribus . We figured out the /etc/hosts was not being taken into account and ran our own fake DNS server. With that we were able to change the IP resolution and noticed that HAProxy does pick up the changes.
The keys were removing the libc resolution (we initially thought it was only for startup after which HAProxy would move into the default resolvers) and adding the resolvers default line to the default-server line to actually take it into use (we thought the default resolver would be used by default).
The issue with mixing libc resolution with resolvers is that you don’t notice when something is broken with the resolver configuration during startup. You only notice because changes are not picked up.
By removing libc you make sure that either resolvers are correctly configured, or nothing would work at all.