Config reload with dynamic service discovery via DNS

For me it doesnt work For me it doesn’t work also on 1.9 we get 503’s as soon as the master notify the worker to respawn… If i remove the srv_prepare_ for_resolution section then it seems to be fine…

Hi Francis,

What do you mean by removing srv_prepare_for_resolution ?

Can you tell me exactly what you did, share state file, logs, output of haproxy in debug mode?

I mean, the code works on my laptop with consul as a DNS server.

Looking forward to fixing it!

Baptiste

Note: I am using 1.2.0-dev (I noticed that consul 1.2 seems to have IPv6 errors…)

This is really what I removed from the server.c file

/*
// prepare DNS resolution for this server (but aint this has already been done by the server-template function?)
res = srv_prepare_for_resolution(srv, fqdn);
if (res == -1) {
ha_alert(“could not allocate memory for DNS REsolution for server … ‘%s’\n”, srv->id);
chunk_appendf(msg, “, can’t allocate memory for DNS resolution for server ‘%s’”, srv->id);
HA_SPIN_UNLOCK(SERVER_LOCK, &srv->lock);
goto out;
}
*/

I also had to implement some modifications in : [src/proxy.c]

(https://github.com/ACenterA/haproxy/commit/1ec245208976366960ff62d25000985801b93e46#diff-70645453d998e55219270ded2f5b1b25)
https://github.com/ACenterA/haproxy/commit/d99f3ee0644ad827f5fe9d10067223e62839bd2f

I know the state file changes might have some other impacts, but that is the only way I could get everything “working”.

In short, I launch an “ab -c 10 -n 100 https://myhostname” and i force a reload by sending an kill -SIGUSR2 to the haproxy which restarts the workers and then gives 503 without these fixes on my end…

With my fixes I implemented the service is stable upon reload. I dont think they are the right fixes though.

I’m using consul 1.1.0, but I don’t think the problem is related to it.

I don’t really understand the changes you did in proxy.c. Could you show me the final version of the file.
Also, could you show me an output of the state file?

There is an easier way to test if this all works.

Start HAProxy wait a bit (1 minute), save the server state, then stop HAProxy.

Then start HAProxy in debug mode (comment the master-worker statement in your config file).

Here is an example with my configuration:

./haproxy -d -db -f ./srv-records_server-state.cfg

SNOTE: setting global.maxconn to 2000.

Available polling systems :

epoll : pref=300, test result OK

poll : pref=200, test result OK

select : pref=150, test result FAILED

Total: 3 (2 usable), will use epoll.

Available filters :

[SPOE] spoe

[COMP] compression

[TRACE] trace

Using epoll() as the polling mechanism.

[WARNING] 217/164140 (22976) : Server www/srv5 is DOWN, changed from server-state after a reload. 9 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv6 is DOWN, changed from server-state after a reload. 8 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv7 is DOWN, changed from server-state after a reload. 7 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv8 is DOWN, changed from server-state after a reload. 6 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv9 is DOWN, changed from server-state after a reload. 5 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv10 is DOWN, changed from server-state after a reload. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

==> servers 1 to 4 have not been configured, because their state was fully loaded by the state file below:

1

be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port srvrecord

3 www 1 srv1 192.168.0.1 0 0 2 1 64 7 0 0 7 0 0 0 A1.tld 80 _http._tcp.be1.tld

3 www 2 srv2 192.168.0.4 0 0 2 1 63 7 2 0 6 0 0 0 A4.tld 80 _http._tcp.be1.tld

3 www 3 srv3 192.168.0.2 0 0 2 1 63 7 2 0 6 0 0 0 A2.tld 80 _http._tcp.be1.tld

3 www 4 srv4 192.168.0.3 0 0 2 1 63 7 2 0 6 0 0 0 A3.tld 80 _http._tcp.be1.tld

3 www 5 srv5 - 0 0 1 1 63 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 6 srv6 - 0 0 1 1 63 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 7 srv7 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 8 srv8 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 9 srv9 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 10 srv10 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

I can try to send you a patch with a lot of verbose messages, but it would be easier if I could access to one of your box where this code is installed.

My HAProxy state looks like this for the Discovery

112 defaultback_failsaife 1 varnish1 10.100.20.78 2 0 1 1 203910 15 3 4 6 0 0 0 ip-10-100-20-78.node.aws-us-east-1.consul 4294934537 _tcp_.varnish.service.consul

Indeed the proxy.c changes were mostly to investigate DNS issue… (That 4294934537 is actually 32777 on my server … i guess its an unsigned value)

Note: we can check in two weeks if its ok i could get something to get you access to one of my VM. I believe i had sent you an email on your gmail few weeks back…

After patching the latest 1.9 dev branch I get 503s for all requests after a reload. The state file seems ok but I need to delete the server state file and restart to get requests to work again. I can’t get you access to the host but could run a more verbose patch.

While things are working after fresh startup (./haproxy -f haproxy-dns.cfg -D -st $(cat /tmp/pulsar_haproxy.pid))

sudo echo "show servers state" | socat stdio /var/run/haproxy.sock 
1
# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_portsrvrecord
6 backend_app 1 app1 192.168.0.146 2 0 1 1 39 15 3 4 6 0 0 0 docker01.marathon.mesos 443 _app._tcp.marathon.mesos
6 backend_app 2 app2 - 0 0 1 1 61 5 2 0 6 0 0 0 - 0 _app._tcp.marathon.mesos

Successful request:
Aug 13 11:30:17 localhost haproxy[15553]: 127.0.0.1:48942 [13/Aug/2018:11:30:16.899] http-in~ backend_app/app1 0/0/85/45/131 200 251 - - ---- 1/1/0/0/0 0/0 “GET /app/ping HTTP/1.1”

After reconfig 503s permanently (socat /var/run/haproxy.sock - <<< “show servers state” > /var/lib/haproxy/server-state && ./haproxy -D -f haproxy-dns.cfg -x /var/run/haproxy.sock -sf $(cat /tmp/pulsar_haproxy.pid)

Seems like it was happy to configure slot 1 from the state file
Aug 13 11:30:50 localhost haproxy[15770]: Server backend_app/app2 is DOWN, changed from server-state after a reload. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 13 11:30:50 localhost haproxy[15770]: Server backend_app/app2 is DOWN, changed from server-state after a reload. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

State seems ok?

sudo echo "show servers state" | socat stdio /var/run/haproxy.sock 
1
# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_portsrvrecord
6 backend_app 1 app1 192.168.0.146 2 0 1 1 58 15 3 4 6 0 0 0 docker01.marathon.mesos 443 _app._tcp.marathon.mesos
6 backend_app 2 app2 - 0 0 1 1 7 5 2 0 6 0 0 0 - 0 _app._tcp.marathon.mesos

Unsuccessful requests from then on:
Aug 13 11:31:04 localhost haproxy[15770]: 127.0.0.1:48980 [13/Aug/2018:11:31:02.048] http-in~ backend_app/app1 0/0/-1/-1/2132 503 212 - - SC-- 1/1/0/0/2 0/0 “GET /app/ping HTTP/1.1”

Could you try to compile with commented the following section in server.c ?

I am not sure why but that resolved it for me

Yes, that seems to work well for me too.

I added some ugly print statements but it looks like srv-hostname_dn is not set to the right value in srv_prepare_for_resolution. During clean init it is set to something like _testapps_tcp.marathon.mesos but srv_prepare_for_resolution sets it to something like docker01.marathon.mesos and then nothing works. Pardon my ugly prints, the hostname_dn is binary so doesn’t print well.

[WARNING] 226/134740 (23446) : new_dns_srvrq init srv->name: '_testapps._tcp.marathon.mesos' srv->hostname_dn: '	_testapps_tcmarathonmesos' srv-hostname_dn_len: '30'
[WARNING] 226/134740 (23446) : server-state application loading 'backend_testapps/testapps1'443 _testapps._tcp.marathon.mesos
[WARNING] 226/134740 (23446) : before srv_prepare_for_resolution(), srv->hostname '(null)' srv->hostname_dn '(null)' srv->hostname_dn_len '0'
[WARNING] 226/134740 (23446) : after srv_prepare_for_resolution(), srv->hostname 'docker01.marathon.mesos' srv->hostname_dn docker0marathonmesos' srv->hostname_dn_len '24'

@Baptiste This line must not be right?

http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/server.c;h=1d7a5a771e435f8654dda66da9d881dc0e6f8c39;hb=HEAD#l1489

@Baptiste and @scarey

Just for your information on my setup I also had to comment the following lines

// if (port > USHRT_MAX) {
// chunk_appendf(msg, “, invalid srv_port value ‘%s’”, port_str);
// port_str = NULL;
// }

I know this are not really the right fixes, but since the statefile is quite controlled for my usecase it is sufficient.

1 Like

Is there anything else we can do to help get this resolved?

Hi,

Sorry for the long delay in my answer. I spent some time troubleshooting a last bug on this one.

I pushed the relevant patch on the mailing list this morning, for both HAProxy dev and 1.8.

Could you please give it a try?

(it’s in test env at Jude’s for one day with many reloads and no crash, no issues and server information well replicated from old process to new one).

Baptiste

1 Like

Thanks. I tested the 1.9 patch and it looks good.

I forgot to ask previously…does the SIGUSR2 reconfig work with DNS or will I need to switch to the server state file like I used here for testing?

if you need frequent reload, it’s safer to use server state file as well. The reload will be more consistent.

Hi @Baptiste

Thanks a lot. This works for me too after testing it two days in my dev setup.

And thanks a ton for spending your time on helping resolve this issue.

For anybody wanting the link to the patch, this is the one i used:
https://www.mail-archive.com/haproxy@formilux.org/msg31155.html

Hi @Baptiste ,

Our configuration

server-template srv 4 _testsrv._tcp.service.consul inter 1s resolvers consul resolve-prefer ipv4 resolve-opts allow-dup-ip check

“show servers state” Result:
# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port srvrecord
4 testsrv.configuration.abc.com 1 srv1 10.90.21.103 0 96 1 1 20373 15 3 0 14 0 0 0 - 4294934534 _testsrv._tcp.service.consul
4 testsrv.configuration.abc.com 2 srv2 10.90.21.107 0 96 1 1 20365 15 3 0 14 0 0 0 - 4294934530 _testsrv._tcp.service.consul
4 testsrv.configuration.abc.com 3 srv3 10.90.21.103 0 96 1 1 20409 15 3 0 14 0 0 0 - 4294934533 _testsrv._tcp.service.consul
4 testsrv.configuration.abc.com 4 srv4 10.90.21.103 2 0 1 1 20554 15 3 4 6 0 0 0 consul-node-10-90-21-103.node.ayt.consul 4294934535 _testsrv._tcp.service.consul

HAProxy Stat UI gives: 10.90.21.103:-32761 (Negative I think it minus 65536)

IP Address is ok but the port is wrong it should be 32775

I tried with 1.8.14 and 1.9-dev8, result is same.

Any idea?

Ping @Baptiste

Hi @Baptiste are these fixes on Docker Hub? I’m currently experiencing the same issue using the state file and Docker image v1.9.2 (SIGHUP or SIGUSR2).

Guys,
For some reasons, I can’t reconnect to discourse :slight_smile:

In the mean time, and to make it clearer, could you open a new discourse thread where you explain your problem and share your config and all your troubleshooting steps?

I mean, the issue in this thread is supposed to be solved already and trying to find the info for your own issue is a pain :slight_smile:

1 Like