Balance uri / consistent hashing / redispatch 3, not redispatching?

Hi!

We have some issues where requests doesn’t seem to rebalance to another host if we do a graceful shutdown, we have “retries 20”, and “option redispatch 3” combined with “balance uri” and “hash-type consistent djb2”, and when we close the listening socket of the service haproxy seems to spend all 20 retries on the same host and then returning a 503 even though it should retry on another host at the third reconnect attempt due to redispatch.

Wanted behaviour is: if connection-refused, pick another server, doesn’t matter if it’s random, just pick some other server which works.

There are a couple of loglines showing the behaviour this error after the config.

Config:

global
log /dev/log len 65535 local0 info alert
log /dev/log len 65535 local1 notice alert
user haproxy
group haproxy
nbproc 1
nbthread 1
maxconn 100000
hard-stop-after 600s # After 10 minutes (clean soft-stop on reloads for mostly tcp connections)
stats socket /var/run/haproxy.sock mode 660 level admin
# server-state-file /var/lib/haproxy/server-state
stats timeout 2m # Wait up to 2 minutes for input
master-worker # Launches one master process and a number of additional worker processes

defaults
log global
mode http
option httplog
timeout connect 3s
timeout client 30s
timeout server 30s
timeout http-request 30s
timeout http-keep-alive 60s
timeout queue 120s
timeout check 10s
retries 20 # Max retry attempts on a single server during connect failures
option redispatch 3 # Allow the redispatch to another server on every Xth retry
option forwardfor # Forward request headers from the original client to the backend
# load-server-state-from-file global
default-server init-addr last,none fastinter 1s rise 2 downinter 1s fall 2 on-error fastinter # Skip DNS resolution on startup (lazy resolution) and aggressive health checking
no option http-server-close # keep backend connections alive
option tcp-smart-connect
option tcp-smart-accept
option splice-auto
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http

frontend http-in
bind *:80
log-tag haproxy.requests
maxconn 100000
capture request header User-Agent len 30
capture request header X-Request-ID len 36
capture request header Host len 32
log-format “{"message_type":"HTTP","request_time":"%t","host":"%H","protocol":"http","http_status":%ST,"user_agent":%{+Q}[capture.req.hdr(0)],"unique_id":%{+Q}[capture.req.hdr(1)],"headers":"%hr","endpoint":"%HP","backend":"%b","backend_name":%{+Q}[capture.req.hdr(2)],"http_method":"%HM","upstream_response_time":%Tr,"upstream_connect_time":%Tc,"bytes_read":%B,"sconn":"%sc","bconn":"%bc","fconn":"%fc","upstream_addr":"%si","upstream_port":"%sp","server_name":"%s","source_addr":"%bi","source_port":"%sp","retries":"%rc","bytes_uploaded":%U,"session_duration":%Tt,"termination_state":"%ts","http_query_params":"%HQ","accept_time":%Th,"idle_time":%Ti,"client_time":%TR,"wait_time":%Tw,"download_time":%Td,"active_time":%Ta}”

use_backend configserver if { hdr(Host) -i configserver }

backend configserver
mode http
option allbackups
balance uri
hash-type consistent djb2
hash-balance-factor 150
server configserver-eu-west-1a-1 10.14.66.188:17914 maxconn 200 check backup
server configserver-eu-west-1a-2 10.14.66.188:17978 maxconn 200 check backup
server configserver-eu-west-1a-3 10.14.66.188:17987 maxconn 200 check backup
server configserver-eu-west-1a-4 10.14.75.245:17961 maxconn 200 check backup
server configserver-eu-west-1a-5 10.14.75.245:18000 maxconn 200 check backup
server configserver-eu-west-1b-6 10.14.80.211:16616 maxconn 200 check
server configserver-eu-west-1b-7 10.14.80.211:16625 maxconn 200 check
server configserver-eu-west-1b-8 10.14.92.90:16854 maxconn 200 check
server configserver-eu-west-1b-9 10.14.92.90:16859 maxconn 200 check

Logs:

message_type:HTTP backend:configserver request_time:21/Dec/2018:03:00:52.624 host:i-04785de9a52f8c57f protocol:http http_status:503 user_agent: unique_id: headers:{||configserver} endpoint:/path1 backend_name:configserver http_method:GET upstream_response_time:-1 upstream_connect_time:-1 bytes_read:213 sconn:0 bconn:0 fconn:1 upstream_addr:10.14.80.211 upstream_port:16636 server_name:configserver-eu-west-1b-9 source_addr:10.14.80.211 source_port:16636 retries:20 bytes_uploaded:98 session_duration:26,951 termination_state:SC http_query_params: accept_time:0 idle_time:6,924 client_time:0 wait_time:18,025 download_time:-1 active_time:20,027 environment_type:prod local_ip:10.14.80.211 cluster:media system_timestamp:December 21st 2018, 04:01:19.000 tags.service:configserver tags.host:i-04785de9a52f8c57f tags.cluster:media tags.local_ip:10.14.80.211 logcount:1 @timestamp:December 21st 2018, 04:01:19.574 _id:zhq1zmcBsOLL-aj_9W0r _type:fluentd _index:haproxy-2018.12.21 _score: -

message_type:HTTP backend:configserver request_time:21/Dec/2018:03:00:50.562 host:i-02d86a4420a5ebf1f protocol:http http_status:503 user_agent: unique_id: headers:{||configserver} endpoint:/path2 backend_name:configserver http_method:GET upstream_response_time:-1 upstream_connect_time:-1 bytes_read:213 sconn:0 bconn:0 fconn:1 upstream_addr:10.14.92.90 upstream_port:16867 server_name:configserver-eu-west-1b-12 source_addr:10.14.92.90 source_port:16867 retries:20 bytes_uploaded:97 session_duration:26,027 termination_state:SC http_query_params: accept_time:0 idle_time:6,001 client_time:0 wait_time:18,023 download_time:-1 active_time:20,026 environment_type:prod local_ip:10.14.92.90 cluster:media system_timestamp:December 21st 2018, 04:01:16.000 tags.service:configserver tags.host:i-02d86a4420a5ebf1f tags.cluster:media tags.local_ip:10.14.92.90 logcount:1 @timestamp:December 21st 2018, 04:01:16.590 _id:shW1zmcBiH4YVdlV0fl7 _type:fluentd _index:haproxy-2018.12.21 _score: -

message_type:HTTP backend:configserver request_time:21/Dec/2018:02:56:55.415 host:i-02d86a4420a5ebf1f protocol:http http_status:503 user_agent: unique_id: headers:{||configserver} endpoint:/path3 backend_name:configserver http_method:GET upstream_response_time:-1 upstream_connect_time:-1 bytes_read:213 sconn:0 bconn:0 fconn:1 upstream_addr:10.14.92.90 upstream_port:16831 server_name:configserver-eu-west-1b-7 source_addr:10.14.92.90 source_port:16831 retries:20 bytes_uploaded:95 session_duration:23,147 termination_state:SC http_query_params: accept_time:0 idle_time:3,117 client_time:0 wait_time:18,026 download_time:-1 active_time:20,030 environment_type:prod local_ip:10.14.92.90 cluster:media system_timestamp:December 21st 2018, 03:57:18.000 tags.service:configserver tags.host:i-02d86a4420a5ebf1f tags.cluster:media tags.local_ip:10.14.92.90 logcount:1 @timestamp:December 21st 2018, 03:57:18.603 _id:KhmyzmcBsOLL-aj_TpV- _type:fluentd _index:haproxy-2018.12.21 _score: -

message_type:HTTP backend:configserver request_time:21/Dec/2018:02:56:58.225 host:i-02d86a4420a5ebf1f protocol:http http_status:503 user_agent: unique_id: headers:{||configserver} endpoint:/path4 backend_name:configserver http_method:GET upstream_response_time:-1 upstream_connect_time:-1 bytes_read:213 sconn:0 bconn:0 fconn:1 upstream_addr:10.14.92.90 upstream_port:16831 server_name:configserver-eu-west-1b-8 source_addr:10.14.92.90 source_port:16831 retries:20 bytes_uploaded:99 session_duration:20,041 termination_state:SC http_query_params: accept_time:0 idle_time:1 client_time:0 wait_time:18,038 download_time:-1 active_time:20,040 environment_type:prod local_ip:10.14.92.90 cluster:media system_timestamp:December 21st 2018, 03:57:18.000 tags.service:configserver tags.host:i-02d86a4420a5ebf1f tags.cluster:media tags.local_ip:10.14.92.90 logcount:1 @timestamp:December 21st 2018, 03:57:18.303 _id:2xmyzmcBsOLL-aj_TpR- _type:fluentd _index:haproxy-2018.12.21 _score: -

Which release is this and can you provide the output of haproxy -vv?

Hi!

it’s 1.8.14-1ppa1~xenial from ppa:vbernat/haproxy-1.8 for ubuntu xenial.

haproxy -vv
HA-Proxy version 1.8.14-1ppa1~xenial 2018/09/23
Copyright 2000-2018 Willy Tarreau willy@haproxy.org

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

The current implemented behavior is that a redispatch to a backup server only occurs if health-checking considers all non-backup servers dead.

I don’t know how you simulate the failure, but I’d assume that’s the problem here.

For more information see this thread from 2008:

http://www.formilux.org/archives/haproxy/0812/1571.html

Can you elaborate what your use-case case is? Maybe there is an alternative to reach your objective.

Redispatch should retry on another healthy non-backup server before even considering backups, right?

I think the problem here is that we have the consistent hashing locking the request + all retries to the same backend while we’d prefer it to just pick any other backend when there are connect-errors.

The failure is easily simulated by doing a “graceful shutdown” of a single backend server (we do this very often due to scaling the number of backends up / down depending on load).
Whenever a backend service is terminating (we’re scaling down) it first shuts down the listening socket (a.k.a not accepting new connections) and then finishes processing the requests it has in flight before finally terminating completely.

The use-case is a service which caches content. We want to do consistent-hashing over these backend-instances, but at the same time the backend-instances are autoscaled so they may come and go at any time. If a certain backend is down for some reason we just want haproxy to pick another backend instead of this current very simple behaviour of just hammering the same (non-existent) backend with a new connect-request every second until all retries have been consumed, and then returning a 503.

Ideally haproxy should do consistent hashing, and if a connection is refused it should pick another backend and retry the request to the new backend asap.

Yes, absolutely.

So what you are saying is that you shutdown just a single server, and there are a lot of non-backup servers working just fine, but redispatch doesn’t work anyway?

I didn’t see that in my repro, but I will check again with your configuration.

I’ve seen in your configuration that you have “backend servers” with the same IP on different ports. Can you clarify whether you shutdown an entire IP address (thus, multiple servers from a haproxy point of view) or just a single service (a backend server, from haproxy point of view).

You are right, redispatch doesn’t do anything with hash-based load-balancing algorithms, like balance source or balance uri.

Could you try the following patch:

diff --git a/src/backend.c b/src/backend.c
index b3fd6c6..7b0c933 100644
--- a/src/backend.c
+++ b/src/backend.c
@@ -614,7 +614,7 @@ int assign_server(struct stream *s)
 
 		case BE_LB_LKUP_CHTREE:
 		case BE_LB_LKUP_MAP:
-			if ((s->be->lbprm.algo & BE_LB_KIND) == BE_LB_KIND_RR) {
+			if ((s->be->lbprm.algo & BE_LB_KIND) == BE_LB_KIND_RR || (s->be->lbprm.algo & BE_LB_KIND) == BE_LB_KIND_HI) {
 				if (s->be->lbprm.algo & BE_LB_LKUP_CHTREE)
 					srv = chash_get_next_server(s->be, prev_srv);
 				else

I’ve seen in your configuration that you have “backend servers” with the same IP on different ports. Can you clarify whether you shutdown an entire IP address (thus, multiple servers from a haproxy point of view) or just a single service (a backend server, from haproxy point of view).

Yes, the container-scheduler may put multiple instances of the cache on the same server, when scaling we just bring down one instance of a container at a time (terminate the haproxy-backend representing a single port on a host at a time).
If a physical host dies we may loose multiple instances of the cache at the same time but then we expect some kind of error anyways as the in-flight requests will also die, the issue we’re focusing on now is just to avoid these last errors when we scale our environment up/down to handle the daily load fluctuations.

You are right, redispatch doesn’t do anything with hash-based load-balancing algorithms, like balance source or balance uri .

Great! Then my “hunch” wasn’t wrong, and i’m not going crazy. :slight_smile:
I’ll try to patch our build the upcoming days and report back how it works.

Thank you so much!

Please try this patch instead, my patch is not correct:

https://pastebin.com/e56isUTa

Hi!

I’ve tried this patch, pulled latest 1.9 source-package from vbernat’s ppa, added this patch and pushed it to my own ppa: https://launchpad.net/~cetex/+archive/ubuntu/haproxy

it seemed alright in our devel and test environments, but when i tried to roll it out in production we hit major issues and got segfaults everywhere:

kernel: [ 121.702750] haproxy[8328]: segfault at 20000000008 ip 0000020000000008 sp 00007ffe91c49608 error 14 in haproxy[55b1f3b4f000+1d6000]
kernel: [ 123.023240] haproxy[8367]: segfault at 0 ip (null) sp 00007fff83f9cda8 error 14 in haproxy[56184e1b5000+1d6000]
systemd[1]: haproxy.service: Main process exited, code=exited, status=139

I haven’t seen or been able to recreate this in our test or devel environments before or after I tried in production.

Any Ideas where it might be failing?

A patch to fix this issue was already committed to master.

I don’t recommend using 1.9.0 at this time, the crash is probably related to an issue that has just been fixed, but will be backported to 1.9 tree only next week (when 1.9.1 will be released).

If you want to test this patch, I suggest you stick with 1.8 for now, or wait for the releases next week (which will also contain the fix for the consistent hash redispatch issue).

1.8.17 and 1.9.1 have been released with the fix for this bug.

Alright. I tried applying it to 1.8.16 (from vbernat/haproxy-1.8 PPA) and got some rejects:
https://pastebin.com/8i0G5hiH

This was the reason I moved to 1.9 in the last test, as I wanted to do the same as you seemed to be doing and the patch only applied cleanly on 1.9. :slight_smile:

But i’ve now applied it to 1.8.16, and fixed a few other things not in the patch to match on 1.8.16, made a build, and htis time I was a bit more careful when testing it in our environment before rolling it out in fully prod.

I’ve had it running on a subset of the prod-environment for 1.5 days, and finally deployed it fully in prod this morning, no more 503 errors as far as I can see, althogh it’s still too early to tell if this solves the issue completely or not as most of the scale-out / scale-in events hare happening during the evening (in around 7-13hours)

I will report back when we have more data.

Please use a unpatched 1.8.17 though. The patch has been revised multiple times since my initial proposal and ended up being completely different than my first proposal.

Hi!

I will try out an unpatched 1.8.17 after we’ve verified that this is better than before. But it seems like the patch you posted above (the one on pastebin) is the same one as the one included in 1.8.17.

https://pastebin.com/e56isUTa 4 - this patch is the one i’ve applied on top of 1.8.16, and rolled out.
http://git.haproxy.org/?p=haproxy-1.8.git;a=commit;h=5f768a2eab35e7ac16f49cd2c0b495e3daae2e81 - The pastebin patch looks like the same patch as this one included in 1.8.17.

The patch seems to have a positive effect, we’re seeing less http 503’s, although there’s still some 503’s where retries hit max-limit of 20 when there’s more scaling activity going on.

Current theory is that sometimes a few (3-4) backends out of ~20-30 backends are terminated almost simultaneously for different reasons (scaling down number of containers while also shuffling some of them around to get a more even cpu-load for example).

I also think this patch only supports switching servers once?
For exapmle: if server A is the server to use according to consistent hashing, and it refuses new connections (it’s down), this patch will make haproxy pick Server B instead, but if B is also down it won’t try server C, instead it till try A (or B) again?

There is a slight difference is in src/lb_chash.c:

-       while (p->lbprm.chash.balance_factor && !chash_server_is_eligible(nsrv)) {
+       while (nsrv == avoid || (p->lbprm.chash.balance_factor && !chash_server_is_eligible(nsrv))) {

But this won’t make a difference unless you remove hash-balance-factor from the configuration.

I’m just saying that ultimately you should switch to a official release, that’s all.

I don’t think redispatch will try a third server, no.

Alright. Thanks for that.

We’ve been running this custom patch for some time now and it’s working as intended. We also upgraded to mainline 1.8.17 a few days ago from vbernat’s PPA and have had no issues.

Thanks a lot for the help!