Never mark backend as failed?

Hi. We have a backend with a single server:

defaults
retries 3
timeout connect 5000
timeout client 100000
timeout server 100000

backend bk_foo
mode tcp
no option http-server-close
log global
option tcplog
timeout server 1m
timeout connect 5s
server foo smtp.example.com:587 check

The problem is that if smtp.example.com becomes unreachable due to network problems, it is marked as down/failed by haproxy and never comes back up even when the server does, until haproxy is reloaded manually. Is there any way to never mark it as down? Thanks.

Share the output of haproxy -vv and enable logging. Then share the logs (you should see backend server down event and usually you would see the server up event as well).

No idea why this happens though, never heard of such an issue.

Sorry I can’t run them at the moment, but it’s haproxy 1.7.5 on FreeBSD 10.3. Please see this post for why this might be an issue: https://serverfault.com/questions/666600/haproxy-does-not-recover-after-failed-check?answertab=votes#tab-top

“And once a backend is marked as down it doesn’t go back up (this is not documented, I came to this conclusion based on my experience).”

Somehow no corresponding UP event was logged until haproxy was manually restarted a few hours later.

Apr 19 17:41:16 foo haproxy[76287]: Server bk_foo/foo is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 14ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Apr 19 17:41:16 foo haproxy[76287]: backend bk_foo has no server available!

Apr 20 02:44:40 foo haproxy[21656]: Proxy foo started.
Apr 20 02:44:40 foo haproxy[21656]: Proxy bk_foo started.
Apr 20 02:44:41 foo haproxy[76287]: Stopping proxy foo in 0 ms.
Apr 20 02:44:41 foo haproxy[76287]: Stopping backend bk_foo in 0 ms.
Apr 20 02:44:41 foo haproxy[76287]: Proxy foo stopped (FE: 90305 conns, BE: 0 conns).
Apr 20 02:44:41 foo haproxy[76287]: Proxy bk_foo stopped (FE: 0 conns, BE: 90305 conns).

This was haproxy 1.7.5. I’ve just upgraded to 1.7.10 in case this was a bug or something.

Hi Rihad, I have the exact same issue. Did you manage to find a solution?

Thanks for your help!

It looks like the health check does not recover in those cases.

Capturing the health check traffic, capturing the syscalls (via strace -tt) and, very importantly, providing the output of haproxy -vv is required to troubleshoot further.

Thanks for the answer. See below the outcome of haproxy -vv

HA-Proxy version 1.8.4-1deb90d 2018/02/08
Copyright 2000-2018 Willy Tarreau willy@haproxy.org

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label
OPTIONS = USE_LIBCRYPT=1 USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

1.8.4 is pretty old at this point, and has a a large number of subsequently fixed bugs. But I don’t know if that is the issue.

Try upgrading to a recent stable version of haproxy. If that doesn’t help, you will have to provide the additional informations requested earlier (capturing the health check traffic, capturing the syscalls: via strace -tt ).

Thank you. The challenge is that we don’t manage to reproduce the issue. It happens for an unknown reason after a few days of operation. When I have more details, I will provide.