I want to configure my backend in a way, that any server returning a 503 (or general 5xx) error should be taken offline until it passes the health check again. This should happen immediately when the server returns this code the first time and not only when the health check runs.
Reason behind this is that I have servers that return 503 during restart / maintenance. Due to the fact that this 503 is served within <1ms haproxy directs dozens of calls to this server until the health check fails.
I don’t think this is possible, it’s also a terrible idea.
Suppose a specific part of your application generates 503 errors which is not important (like a new HTTP endpoint).
You want to take offline one server after another, until your entire serverfarm is down because of a simple, unimportant bug in a part of the application?
How you really solve the problem you are facing:
You make your application return appropriate error codes to health checks for a grace period, so that haproxy health checks have enough time to detect the change and complete in flight requests, and only then are you actually executing intrusive changes on your server.
Thanks for the comments. That would be a possibility, if I had control over the source of the servers These are legacy single threaded servers and haproxy is a “in between” solution to simulate multi threading by load balancing between multiple single threaded servers with maxconn 1
I assume I will need to put more logic in the client side to handle these use cases.
Perhaps a more suitable alternative would be to configure haproxy to retry on a different server in certain cases like 503 errors, I suggest reading more about retry-on and option redispatch in the docs:
If you can configure your servers to use a health check endpoint, and have them return a 404 on the health check endpoint during restarts and maintenance, then using a health check with that option will probably do what you want.
Note however that you probably want to wait at least inter ms between starting to return a 404 and actually taking the service down for maintenance to make sure haproxy has had time to run a health check since you changed the status.