Upon switching to 1.8.14 from 1.6 we’ve been made aware that one of our backends have been redirecting to our maintenance page and they correlate to config reloads.
This is the portion of our haproxy config that we are falling into, we believe.
backend sslservice
acl NOT_ENOUGH_CAPACITY nbsrv(sslservice) le 0
redirect location {{ maintenance_url }} if NOT_ENOUGH_CAPACITY
We currently poll our autoscaling groups for any new/removed machines and update our haproxy config using a python script that runs every 4 minutes via a cron job. This is the method by which we’ve been updating our haproxy config since before I joined.
As an immediate need we are looking to find the best way to stop these maintenance pages during reloads so we tested out HAProxy Hitless Reloads. but we still have the issue of what seems to be no available backends on config reloads. I’m not 100% that we’ve properly configured it, but the following are exerpts from our configs.
Stats Socket:
stats socket /var/lib/haproxy/stats expose-fd listeners
Enabling master-worker
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 1000000
user haproxy
group haproxy
daemon
master-worker
I assume this is simply caused by health-checks seeing the server as down initially. You can confirm that by taking a look at your logs.
Health-check configuration would be important to know here, but I assume that you will have to use the load-server-state-from-file feature, so that the time between the reload and when health-checks are done for the first time the previous health check state is used.
Not sure what changed between 1.6 and 1.8 that is causing the difference in behavior at this point. Logs and full configuration would be needed to analyze that.
option httpchk GET /healthcheck
http-check expect rstring (\*\*\*OK\*\*\*ENABLED=1)
default-server inter 30s fastinter 15s fall 3 rise 2
We are using load-server-state-from-file. There are no logs stating that there are no backends available when we do a reload. It will take me some time to strip out sensitive data.
Health-check configuration would be important to know here, but I assume that you will have to use the load-server-state-from-file feature, so that the time between the reload and when health-checks are done for the first time the previous health check state is used.
All the backends are up before a reload. Some of our pools have like ~20+ nodes and it’d be hard for me to believe that we went from 50 to 0 in less then 15s. Would adding the stats socket listener be a better option than the state file?
I can work on stripping out sensitive data from our configs and logs.
Our 1.6 instances have seemingly been working for a while now without issues with reloading. We’ve only recently had to scale up our instances due to high cpu usage. We wanted to cut down on the number of health checks so we decided to upgrade and take advantage of the multithreading. We initially went to a 36 core cpu and nbproc 1, nbthread 36, but had issues with either not leaving a cpu for the os or haproxy wrapping up 36 threads during a reload as we saw 3 processes running and the cpu at 100%. We dropped it down to 16 threads and have been stable as of late, but have had some issues with latency.
So the only changes we made were adding multithreading and then tried to enable seamless reloads using the expose-fd listeners, but haven’t had much luck.
After some load testing it appears that in 1.6 haproxy has no issues with reloads, but 1.8 definitely does. We have disconnects of a bunch of calls at the time of reload on 1.8 and this is with no other changes, but 1.8. We used the same config and did the same cpu mapping and nbproc definitions.
We started testing and it looks like this behavior is exhibited as far back at 1.6.14. We used 1.8.0, 1.7.11 and then the last 1.6 release and they all exhibited the same behavior on reloads. Have you had any success in replicating this?
We were really hoping to use 1.8 to reduce down our health checks and have a centralized stats setup, but this is causing inaccurate blips causing issues with our front end throwing errors. The article HAProxy Hitless Reloads sounds like a great solution, but are there any requirements for it to properly function? ie, certain OS, systemd vs init.d, etc.
diff --git a/include/proto/backend.h b/include/proto/backend.h
index 69ee31c..d82015b 100644
--- a/include/proto/backend.h
+++ b/include/proto/backend.h
@@ -47,9 +47,7 @@ int be_lastsession(const struct proxy *be);
/* Returns number of usable servers in backend */
static inline int be_usable_srv(struct proxy *be)
{
- if (be->state == PR_STSTOPPED)
- return 0;
- else if (be->srv_act)
+ if (be->srv_act)
return be->srv_act;
else if (be->lbprm.fbck)
return 1;
This needs more troubleshooting, but this should revert a logic change that was introduced in 1.6.11.
Also, I’m gonna need the complete configuration, not the hostname and the IP addresses of course, but the rest of the configuration nonetheless. I’m not sure why this change would impact nbsrv on reload, idle HTTP sessions are supposed to be closed immediately after the reload, so a backend shutting should theoretically not have any impact.
We applied the patch and it now functions as expected and we no longer see any disconnects or 302s. This is awesome news! How long until we see this as part of an official patch?
My suggestion for this would be to use the stopping boolean to restrict the redirect to the maintenance page, avoiding the redirect in the old process while reloading.
So:
acl NOT_ENOUGH_CAPACITY nbsrv(sslservice) le 0
acl STOPPING stopping
redirect location {{ maintenance_url }} if NOT_ENOUGH_CAPACITY ! STOPPING
or anonymous:
acl NOT_ENOUGH_CAPACITY nbsrv(sslservice) le 0
redirect location {{ maintenance_url }} if NOT_ENOUGH_CAPACITY ! { stopping }
When the user hits haproxy on the old process with a request while the process is stopping, but the are really no servers available in the backend for real, a redirect will not happen this way though.
Thanks @lukastribus for the deeper dive. Using the stopping bool does fix our use case. We are still testing actual lack of capacity as well. Ill update with our results.
Is this the best practice for our use case?
Also, just want to confirm if you still needed those global config settings still?
No, don’t need the config settings, as I can reproduce it. I’d like to hear what Willy thinks about this, but I’m pretty sure the stopping boolean is the right way to go here.
Hi Lukas, I’m terribly sorry for having missed your message. Given that servers are not checked during stopping, I think we’d rather continue to report the existing nbsrv_act as you did in your first patch, so that nbsrv reports the last known number of servers, which also matches what is visible on the stats page, and possibly what the visitor expects at this moment (i.e. finish his session on the same server and be done).
I also see quite some value in your proposed STOPPING ACL, because likely some people will want to use it in some of their rules, but I think it’s a separate point, a nice-to-have, but your first patch is the real fix. Care to send me a patch ?
I didn’t notice Marcin’s patch :-/ I see the point, I think the problem he tries to address is the case where the proxy was disabled by hand. I hadn’t thought about this case
So it might be more complicated because basically we have the same state for two different ones. Then probably that in the mean time only your boolean option is the only solution which doesn’t risk to break anything.
Yeah, one flag for 2 real states … that’s why we are in this situation.
I too think it’s best to use the stopping boolean in this situation. I will check in the next few days if there is something in the docs that can be improved to provide this hint.
@Gris13 can you confirm your use-case works correctly with the stopping flag?