HAProxy 1.8 Config reload issues

Upon switching to 1.8.14 from 1.6 we’ve been made aware that one of our backends have been redirecting to our maintenance page and they correlate to config reloads.

This is the portion of our haproxy config that we are falling into, we believe.

backend sslservice
acl NOT_ENOUGH_CAPACITY nbsrv(sslservice) le 0
redirect location {{ maintenance_url }} if NOT_ENOUGH_CAPACITY

We currently poll our autoscaling groups for any new/removed machines and update our haproxy config using a python script that runs every 4 minutes via a cron job. This is the method by which we’ve been updating our haproxy config since before I joined.

As an immediate need we are looking to find the best way to stop these maintenance pages during reloads so we tested out HAProxy Hitless Reloads. but we still have the issue of what seems to be no available backends on config reloads. I’m not 100% that we’ve properly configured it, but the following are exerpts from our configs.
Stats Socket:

stats socket /var/lib/haproxy/stats expose-fd listeners
Enabling master-worker
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 1000000
user haproxy
group haproxy
daemon
master-worker

Am I missing something here?

I assume this is simply caused by health-checks seeing the server as down initially. You can confirm that by taking a look at your logs.

Health-check configuration would be important to know here, but I assume that you will have to use the load-server-state-from-file feature, so that the time between the reload and when health-checks are done for the first time the previous health check state is used.

Not sure what changed between 1.6 and 1.8 that is causing the difference in behavior at this point. Logs and full configuration would be needed to analyze that.

Our health checks are all

option      httpchk     GET    /healthcheck
http-check expect rstring (\*\*\*OK\*\*\*ENABLED=1)
default-server  inter 30s   fastinter 15s    fall 3  rise 2

We are using load-server-state-from-file. There are no logs stating that there are no backends available when we do a reload. It will take me some time to strip out sensitive data.

Health-check configuration would be important to know here, but I assume that you will have to use the load-server-state-from-file feature, so that the time between the reload and when health-checks are done for the first time the previous health check state is used.

All the backends are up before a reload. Some of our pools have like ~20+ nodes and it’d be hard for me to believe that we went from 50 to 0 in less then 15s. Would adding the stats socket listener be a better option than the state file?

I can work on stripping out sensitive data from our configs and logs.

Are you positive that this worked fine in 1.6? Did you enable any new features in 1.8 that where not there in the 1.6?

I will try to reproduce this, it sounds to me like nbsrv is confused initially…

Our 1.6 instances have seemingly been working for a while now without issues with reloading. We’ve only recently had to scale up our instances due to high cpu usage. We wanted to cut down on the number of health checks so we decided to upgrade and take advantage of the multithreading. We initially went to a 36 core cpu and nbproc 1, nbthread 36, but had issues with either not leaving a cpu for the os or haproxy wrapping up 36 threads during a reload as we saw 3 processes running and the cpu at 100%. We dropped it down to 16 threads and have been stable as of late, but have had some issues with latency.

So the only changes we made were adding multithreading and then tried to enable seamless reloads using the expose-fd listeners, but haven’t had much luck.

After some load testing it appears that in 1.6 haproxy has no issues with reloads, but 1.8 definitely does. We have disconnects of a bunch of calls at the time of reload on 1.8 and this is with no other changes, but 1.8. We used the same config and did the same cpu mapping and nbproc definitions.

Nov  6 16:33:30 1-primary haproxy[18279]: xxx.xxx.xxx.xxx:26485 [06/Nov/2018:16:33:30.760] dummy.com_ssl~ api_ssl_v3/<NOSRV> 0/-1/-1/-1/1 302 137 - - LRNN 1/1/0/0/3 0/0 "GET /v3/ServerInfo HTTP/1.1"
Nov  6 16:33:30 1-primary haproxy[18272]: xxx.xxx.xxx.xxx:45887 [06/Nov/2018:16:33:30.874] dummy.com_ssl~ api_ssl/<NOSRV> 0/-1/-1/-1/0 302 137 - - LRNN 36/36/0/0/3 0/0 "POST /rest/thingymajig.aspx?op=complete&zoneid=id&zuid=id HTTP/1.1"
Nov  6 16:33:31 1-primary haproxy[18278]: xxx.xxx.xxx.xxx:63032 [06/Nov/2018:16:33:31.262] dummy.com_ssl~ api_ssl_v3/<NOSRV> 0/-1/-1/-1/0 302 137 - - LRNN 8/8/0/0/3 0/0 "GET /v3/ServerInfo HTTP/1.1"
Nov  6 16:33:31 1-primary haproxy[23468]: xxx.xxx.xxx.xxx:27220 [06/Nov/2018:16:33:31.364] dummy.com_http dummy.com_http/<NOSRV> 0/-1/-1/-1/0 302 145 - - LR-- 1/1/0/0/0 0/0 "GET /haproxy_status HTTP/1.1"
Nov  6 16:33:31 1-primary haproxy[18271]: xxx.xxx.xxx.xxx:36765 [06/Nov/2018:16:33:31.644] dummy.com_ssl~ api_ssl/<NOSRV> 0/-1/-1/-1/0 302 137 - - LRNN 177/177/0/0/3 0/0 "POST /rest/thingymajig.aspx?op=complete HTTP/1.1"
Nov  6 16:33:31 1-primary haproxy[18276]: xxx.xxx.xxx.xxx:14350 [06/Nov/2018:16:33:31.763] dummy.com_ssl~ api_ssl_v3/<NOSRV> 0/-1/-1/-1/0 302 137 - - LRNN 2/2/0/0/3 0/0 "GET /v3/ServerInfo HTTP/1.1"
Nov  6 16:33:32 1-primary haproxy[18271]: xxx.xxx.xxx.xxx:55263 [06/Nov/2018:16:33:32.161] dummy.com_ssl~ api_ssl/<NOSRV> 0/-1/-1/-1/0 302 137 - - LRNN 161/161/0/0/3 0/0 "POST /rest/thingymajig.aspx?op=complete HTTP/1.1"
Nov  6 16:33:32 1-primary haproxy[18271]: xxx.xxx.xxx.xxx:17557 [06/Nov/2018:16:33:32.380] dummy.com_ssl~ api_ssl/<NOSRV> 0/-1/-1/-1/0 302 137 - - LRNN 154/154/0/0/3 0/0 "POST /rest/thingymajig.aspx?op=complete HTTP/1.1"
Nov  6 16:33:32 1-primary haproxy[18271]: xxx.xxx.xxx.xxx:63088 [06/Nov/2018:16:33:32.755] dummy.com_ssl~ api_ssl_v3/<NOSRV> 0/-1/-1/-1/0 302 137 - - LRNN 147/147/0/0/3 0/0 "POST /v3/Items(id)/Folder?overwrite=True&passthrough=False HTTP/1.1"
Nov  6 16:33:33 1-primary haproxy[18271]: xxx.xxx.xxx.xxx:37502 [06/Nov/2018:16:33:33.377] dummy.com_ssl~ api_ssl/<NOSRV> 0/-1/-1/-1/0 302 137 - - LRNN 137/137/0/0/3 0/0 "POST /rest/thingymajig.aspx?op=complete&id=id&zuid=id HTTP/1.1"
Nov  6 16:33:34 1-primary haproxy[18271]: xxx.xxx.xxx.xxx:18932 [06/Nov/2018:16:33:34.755] dummy.com_ssl~ api_ssl_v3/<NOSRV> 0/-1/-1/-1/0 302 137 - - LRNN 100/100/0/0/3 0/0 "GET /v3/ServerInfo HTTP/1.1"
Nov  6 16:33:35 1-primary haproxy[18271]: xxx.xxx.xxx.xxx:14248 [06/Nov/2018:16:33:35.754] dummy.com_ssl~ api_ssl_v3/<NOSRV> 0/-1/-1/-1/1 302 137 - - LRNN 80/80/0/0/3 0/0 "GET /v3/ServerInfo HTTP/1.1"
Nov  6 16:33:36 1-primary haproxy[18271]: xxx.xxx.xxx.xxx:29571 [06/Nov/2018:16:33:36.755] dummy.com_ssl~ api_ssl_v3/<NOSRV> 0/-1/-1/-1/0 302 137 - - LRNN 68/68/0/0/3 0/0 "GET /v3/ServerInfo HTTP/1.1"
Nov  6 16:33:37 1-primary haproxy[18271]: xxx.xxx.xxx.xxx:56552 [06/Nov/2018:16:33:37.059] dummy.com_ssl~ api_ssl/<NOSRV> 0/-1/-1/-1/1 302 137 - - LRNN 60/60/0/0/3 0/0 "POST /rest/thingymajig.aspx?op=complete HTTP/1.1"

This matches up with our reload time.

Nov  6 16:33:29 1-primary haproxy[23467]: Proxy healthcheck started.
Nov  6 16:33:29 1-primary haproxy[23467]: Proxy dummy.com_http started.
Nov  6 16:33:29 1-primary haproxy[18271]: Stopping frontend healthcheck in 0 ms.

I will try to reproduce this.

How are you reloading haproxy exactly and which script do you use (systemd unit file, old init.d scripts, etc). Please share details about that.

We use init.d scripts to handle reloads.

#!/bin/sh
#
# chkconfig: - 85 15
# description: HA-Proxy is a TCP/HTTP reverse proxy which is particularly suited \
#              for high availability environments.
# processname: haproxy
# config: /etc/haproxy/haproxy.cfg
# pidfile: /var/run/haproxy.pid

# Script Author: Simon Matter <simon.matter@invoca.ch>
# Version: 2004060600

# Source function library.
if [ -f /etc/init.d/functions ]; then
  . /etc/init.d/functions
elif [ -f /etc/rc.d/init.d/functions ] ; then
  . /etc/rc.d/init.d/functions
else
  exit 0
fi

# Source networking configuration.
. /etc/sysconfig/network

# Check that networking is up.
[ ${NETWORKING} = "no" ] && exit 0

# This is our service name
BASENAME=`basename $0`
if [ -L $0 ]; then
  BASENAME=`find $0 -name $BASENAME -printf %l`
  BASENAME=`basename $BASENAME`
fi

BIN={{ haproxy_bin_path }}/$BASENAME

CFG=/etc/$BASENAME/$BASENAME.cfg
[ -f $CFG ] || exit 1

PIDFILE=/var/run/$BASENAME.pid
LOCKFILE=/var/lock/subsys/$BASENAME

RETVAL=0

start() {
  quiet_check
  if [ $? -ne 0 ]; then
    echo "Errors found in configuration file, check it with '$BASENAME check'."
    return 1
  fi

  echo -n "Starting $BASENAME: "
  daemon $BIN -D -f $CFG -p $PIDFILE
  RETVAL=$?
  echo
  [ $RETVAL -eq 0 ] && touch $LOCKFILE
  return $RETVAL
}

stop() {
  echo -n "Shutting down $BASENAME: "
  killproc $BASENAME -USR1
  RETVAL=$?
  echo
  [ $RETVAL -eq 0 ] && rm -f $LOCKFILE
  [ $RETVAL -eq 0 ] && rm -f $PIDFILE
  return $RETVAL
}

restart() {
  quiet_check
  if [ $? -ne 0 ]; then
    echo "Errors found in configuration file, check it with '$BASENAME check'."
    return 1
  fi
  stop
  start
}

reload() {
  if ! [ -s $PIDFILE ]; then
    return 0
  fi

  quiet_check
  if [ $? -ne 0 ]; then
    echo "Errors found in configuration file, check it with '$BASENAME check'."
    return 1
  fi

  echo "show servers state" | socat stdio /var/lib/haproxy/stats > /var/lib/haproxy/state/global

  $BIN -D -f $CFG -p $PIDFILE -sf $(cat $PIDFILE)
}
check() {
  $BIN -c -q -V -f $CFG
}

quiet_check() {
  $BIN -c -q -f $CFG
}

rhstatus() {
  status $BASENAME
}

condrestart() {
  [ -e $LOCKFILE ] && restart || :
}

# See how we were called.
case "$1" in
  start)
    start
    ;;
  stop)
    stop
    ;;
  restart)
    restart
    ;;
  reload)
    reload
    ;;
  condrestart)
    condrestart
    ;;
  status)
    rhstatus
    ;;
  check)
    check
    ;;
  *)
    echo $"Usage: $BASENAME {start|stop|restart|reload|condrestart|status|check}"
    exit 1
esac

and we run loads when our configs change with

sudo /sbin/service haproxy reload

We started testing and it looks like this behavior is exhibited as far back at 1.6.14. We used 1.8.0, 1.7.11 and then the last 1.6 release and they all exhibited the same behavior on reloads. Have you had any success in replicating this?

We were really hoping to use 1.8 to reduce down our health checks and have a centralized stats setup, but this is causing inaccurate blips causing issues with our front end throwing errors. The article HAProxy Hitless Reloads sounds like a great solution, but are there any requirements for it to properly function? ie, certain OS, systemd vs init.d, etc.

We stepped through each release and it looks to have been introduced in 1.6.11.

Great, can you try the following patch (on 1.8):

diff --git a/include/proto/backend.h b/include/proto/backend.h
index 69ee31c..d82015b 100644
--- a/include/proto/backend.h
+++ b/include/proto/backend.h
@@ -47,9 +47,7 @@ int be_lastsession(const struct proxy *be);
 /* Returns number of usable servers in backend */
 static inline int be_usable_srv(struct proxy *be)
 {
-        if (be->state == PR_STSTOPPED)
-                return 0;
-        else if (be->srv_act)
+        if (be->srv_act)
                 return be->srv_act;
         else if (be->lbprm.fbck)
                 return 1;

This needs more troubleshooting, but this should revert a logic change that was introduced in 1.6.11.

Also, I’m gonna need the complete configuration, not the hostname and the IP addresses of course, but the rest of the configuration nonetheless. I’m not sure why this change would impact nbsrv on reload, idle HTTP sessions are supposed to be closed immediately after the reload, so a backend shutting should theoretically not have any impact.

We applied the patch and it now functions as expected and we no longer see any disconnects or 302s. This is awesome news! How long until we see this as part of an official patch?

This just reverts the behavior change, I still don’t know the root cause. I need to dig into this some more.

Can you provide your default/global settings, including timeout and keep-alive settings please?

I can reproduce it. This is about the open connections in the old process, not about those in the new one. Because the backend is marked as stopped, nbsrv returns 0 here since 1.6.11 - 57b87714 (“BUG/MINOR: backend: nbsrv() should return 0 if backend is disabled”).

While the impact you faced was certainly not predicted there, the change does fix a legitimate problem and makes nbsrv behavior consistent.

My suggestion for this would be to use the stopping boolean to restrict the redirect to the maintenance page, avoiding the redirect in the old process while reloading.

So:

acl NOT_ENOUGH_CAPACITY nbsrv(sslservice) le 0
acl STOPPING stopping
redirect location {{ maintenance_url }} if NOT_ENOUGH_CAPACITY ! STOPPING

or anonymous:

acl NOT_ENOUGH_CAPACITY nbsrv(sslservice) le 0
redirect location {{ maintenance_url }} if NOT_ENOUGH_CAPACITY ! { stopping }

When the user hits haproxy on the old process with a request while the process is stopping, but the are really no servers available in the backend for real, a redirect will not happen this way though.

@willy what do you think about this?

Thanks @lukastribus for the deeper dive. Using the stopping bool does fix our use case. We are still testing actual lack of capacity as well. Ill update with our results.

Is this the best practice for our use case?
Also, just want to confirm if you still needed those global config settings still?

No, don’t need the config settings, as I can reproduce it. I’d like to hear what Willy thinks about this, but I’m pretty sure the stopping boolean is the right way to go here.

Hi Lukas, I’m terribly sorry for having missed your message. Given that servers are not checked during stopping, I think we’d rather continue to report the existing nbsrv_act as you did in your first patch, so that nbsrv reports the last known number of servers, which also matches what is visible on the stats page, and possibly what the visitor expects at this moment (i.e. finish his session on the same server and be done).

I also see quite some value in your proposed STOPPING ACL, because likely some people will want to use it in some of their rules, but I think it’s a separate point, a nice-to-have, but your first patch is the real fix. Care to send me a patch ?

Thanks, and sorry again for the delay.

I didn’t notice Marcin’s patch :-/ I see the point, I think the problem he tries to address is the case where the proxy was disabled by hand. I hadn’t thought about this case :frowning:

So it might be more complicated because basically we have the same state for two different ones. Then probably that in the mean time only your boolean option is the only solution which doesn’t risk to break anything.

Yeah, one flag for 2 real states … that’s why we are in this situation.

I too think it’s best to use the stopping boolean in this situation. I will check in the next few days if there is something in the docs that can be improved to provide this hint.

@Gris13 can you confirm your use-case works correctly with the stopping flag?

I just rediscovered that I already implemented this stopping sample-fetch 4 years ago, I thought we’d have to develop it right now :slight_smile:

1 Like