Config reload with dynamic service discovery via DNS


#21

For me it doesnt work For me it doesn’t work also on 1.9 we get 503’s as soon as the master notify the worker to respawn… If i remove the srv_prepare_ for_resolution section then it seems to be fine…


#22

Hi Francis,

What do you mean by removing srv_prepare_for_resolution ?

Can you tell me exactly what you did, share state file, logs, output of haproxy in debug mode?

I mean, the code works on my laptop with consul as a DNS server.

Looking forward to fixing it!

Baptiste


#23

Note: I am using 1.2.0-dev (I noticed that consul 1.2 seems to have IPv6 errors…)

This is really what I removed from the server.c file

/*
// prepare DNS resolution for this server (but aint this has already been done by the server-template function?)
res = srv_prepare_for_resolution(srv, fqdn);
if (res == -1) {
ha_alert(“could not allocate memory for DNS REsolution for server … ‘%s’\n”, srv->id);
chunk_appendf(msg, “, can’t allocate memory for DNS resolution for server ‘%s’”, srv->id);
HA_SPIN_UNLOCK(SERVER_LOCK, &srv->lock);
goto out;
}
*/

I also had to implement some modifications in : [src/proxy.c]

(https://github.com/ACenterA/haproxy/commit/1ec245208976366960ff62d25000985801b93e46#diff-70645453d998e55219270ded2f5b1b25)
https://github.com/ACenterA/haproxy/commit/d99f3ee0644ad827f5fe9d10067223e62839bd2f

I know the state file changes might have some other impacts, but that is the only way I could get everything “working”.

In short, I launch an “ab -c 10 -n 100 https://myhostname” and i force a reload by sending an kill -SIGUSR2 to the haproxy which restarts the workers and then gives 503 without these fixes on my end…

With my fixes I implemented the service is stable upon reload. I dont think they are the right fixes though.


#24

I’m using consul 1.1.0, but I don’t think the problem is related to it.

I don’t really understand the changes you did in proxy.c. Could you show me the final version of the file.
Also, could you show me an output of the state file?

There is an easier way to test if this all works.

Start HAProxy wait a bit (1 minute), save the server state, then stop HAProxy.

Then start HAProxy in debug mode (comment the master-worker statement in your config file).

Here is an example with my configuration:

./haproxy -d -db -f ./srv-records_server-state.cfg

SNOTE: setting global.maxconn to 2000.

Available polling systems :

epoll : pref=300, test result OK

poll : pref=200, test result OK

select : pref=150, test result FAILED

Total: 3 (2 usable), will use epoll.

Available filters :

[SPOE] spoe

[COMP] compression

[TRACE] trace

Using epoll() as the polling mechanism.

[WARNING] 217/164140 (22976) : Server www/srv5 is DOWN, changed from server-state after a reload. 9 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv6 is DOWN, changed from server-state after a reload. 8 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv7 is DOWN, changed from server-state after a reload. 7 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv8 is DOWN, changed from server-state after a reload. 6 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv9 is DOWN, changed from server-state after a reload. 5 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv10 is DOWN, changed from server-state after a reload. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

==> servers 1 to 4 have not been configured, because their state was fully loaded by the state file below:

1

be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port srvrecord

3 www 1 srv1 192.168.0.1 0 0 2 1 64 7 0 0 7 0 0 0 A1.tld 80 _http._tcp.be1.tld

3 www 2 srv2 192.168.0.4 0 0 2 1 63 7 2 0 6 0 0 0 A4.tld 80 _http._tcp.be1.tld

3 www 3 srv3 192.168.0.2 0 0 2 1 63 7 2 0 6 0 0 0 A2.tld 80 _http._tcp.be1.tld

3 www 4 srv4 192.168.0.3 0 0 2 1 63 7 2 0 6 0 0 0 A3.tld 80 _http._tcp.be1.tld

3 www 5 srv5 - 0 0 1 1 63 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 6 srv6 - 0 0 1 1 63 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 7 srv7 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 8 srv8 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 9 srv9 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 10 srv10 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

I can try to send you a patch with a lot of verbose messages, but it would be easier if I could access to one of your box where this code is installed.


#25

My HAProxy state looks like this for the Discovery

112 defaultback_failsaife 1 varnish1 10.100.20.78 2 0 1 1 203910 15 3 4 6 0 0 0 ip-10-100-20-78.node.aws-us-east-1.consul 4294934537 _tcp_.varnish.service.consul

Indeed the proxy.c changes were mostly to investigate DNS issue… (That 4294934537 is actually 32777 on my server … i guess its an unsigned value)

Note: we can check in two weeks if its ok i could get something to get you access to one of my VM. I believe i had sent you an email on your gmail few weeks back…


#26

After patching the latest 1.9 dev branch I get 503s for all requests after a reload. The state file seems ok but I need to delete the server state file and restart to get requests to work again. I can’t get you access to the host but could run a more verbose patch.

While things are working after fresh startup (./haproxy -f haproxy-dns.cfg -D -st $(cat /tmp/pulsar_haproxy.pid))

sudo echo "show servers state" | socat stdio /var/run/haproxy.sock 
1
# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_portsrvrecord
6 backend_app 1 app1 192.168.0.146 2 0 1 1 39 15 3 4 6 0 0 0 docker01.marathon.mesos 443 _app._tcp.marathon.mesos
6 backend_app 2 app2 - 0 0 1 1 61 5 2 0 6 0 0 0 - 0 _app._tcp.marathon.mesos

Successful request:
Aug 13 11:30:17 localhost haproxy[15553]: 127.0.0.1:48942 [13/Aug/2018:11:30:16.899] http-in~ backend_app/app1 0/0/85/45/131 200 251 - - ---- 1/1/0/0/0 0/0 “GET /app/ping HTTP/1.1”

After reconfig 503s permanently (socat /var/run/haproxy.sock - <<< “show servers state” > /var/lib/haproxy/server-state && ./haproxy -D -f haproxy-dns.cfg -x /var/run/haproxy.sock -sf $(cat /tmp/pulsar_haproxy.pid)

Seems like it was happy to configure slot 1 from the state file
Aug 13 11:30:50 localhost haproxy[15770]: Server backend_app/app2 is DOWN, changed from server-state after a reload. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 13 11:30:50 localhost haproxy[15770]: Server backend_app/app2 is DOWN, changed from server-state after a reload. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

State seems ok?

sudo echo "show servers state" | socat stdio /var/run/haproxy.sock 
1
# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_portsrvrecord
6 backend_app 1 app1 192.168.0.146 2 0 1 1 58 15 3 4 6 0 0 0 docker01.marathon.mesos 443 _app._tcp.marathon.mesos
6 backend_app 2 app2 - 0 0 1 1 7 5 2 0 6 0 0 0 - 0 _app._tcp.marathon.mesos

Unsuccessful requests from then on:
Aug 13 11:31:04 localhost haproxy[15770]: 127.0.0.1:48980 [13/Aug/2018:11:31:02.048] http-in~ backend_app/app1 0/0/-1/-1/2132 503 212 - - SC-- 1/1/0/0/2 0/0 “GET /app/ping HTTP/1.1”


#27

Could you try to compile with commented the following section in server.c ?

I am not sure why but that resolved it for me


#28

Yes, that seems to work well for me too.