Config reload with dynamic service discovery via DNS


#1

Hi,

I’m noticing the configuration reload isn’t seamless when using dynamic service discovery via DNS. My simple test makes requests every second or so and I trigger a reload somewhere in the middle and there’re 8 seconds or so where services are unavailable. Is that the behavior you would expect? Is there some configuration I can use to minimize the downtime? Any help would be appreciated. If I run the same test with more static configuration the reload works very well.

Test output:

scarey@static:~/dev/data$ ./test.sh 
{"time":1529005368803}Thu Jun 14 19:42:48 UTC 2018
{"time":1529005370141}Thu Jun 14 19:42:50 UTC 2018
{"time":1529005371504}Thu Jun 14 19:42:51 UTC 2018
{"time":1529005372839}Thu Jun 14 19:42:52 UTC 2018
{"time":1529005374175}Thu Jun 14 19:42:54 UTC 2018
{"time":1529005375533}Thu Jun 14 19:42:55 UTC 2018
<html><body><h1>500 Internal Server Error</h1>
An internal server error occured.
</body></html>
Thu Jun 14 19:42:56 UTC 2018
<html><body><h1>500 Internal Server Error</h1>
An internal server error occured.
</body></html>
Thu Jun 14 19:42:58 UTC 2018
<html><body><h1>500 Internal Server Error</h1>
An internal server error occured.
</body></html>
Thu Jun 14 19:42:59 UTC 2018
{"time":1529005380854}Thu Jun 14 19:43:00 UTC 2018
<html><body><h1>500 Internal Server Error</h1>
An internal server error occured.
</body></html>
Thu Jun 14 19:43:02 UTC 2018
{"time":1529005383481}Thu Jun 14 19:43:03 UTC 2018
{"time":1529005384835}Thu Jun 14 19:43:04 UTC 2018
{"time":1529005386178}Thu Jun 14 19:43:06 UTC 2018
{"time":1529005387524}Thu Jun 14 19:43:07 UTC 2018
{"time":1529005388881}Thu Jun 14 19:43:08 UTC 2018
{"time":1529005390241}Thu Jun 14 19:43:10 UTC 2018

HAProxy logs:

Jun 14 19:42:57 192.168.64.146 haproxy[9525]:  Proxy haproxy-statistics started.
Jun 14 19:42:57 192.168.64.146 haproxy[9525]:  Proxy http-in started.
Jun 14 19:42:57 192.168.64.146 haproxy[9525]:  Proxy backend_secure_test-app6 started.
Jun 14 19:42:57 192.168.64.146 haproxy[9525]:  Proxy backend_secure_test-log4j2 started.
Jun 14 19:42:57 192.168.64.146 haproxy[11831]:  Stopping proxy haproxy-statistics in 0 ms.
Jun 14 19:42:57 192.168.64.146 haproxy[11831]:  Stopping frontend http-in in 0 ms.
Jun 14 19:42:57 192.168.64.146 haproxy[11831]:  Stopping backend backend_secure_test-app6 in 0 ms.
Jun 14 19:42:57 192.168.64.146 haproxy[11831]:  Stopping backend backend_secure_test-log4j2 in 0 ms.
Jun 14 19:42:57 192.168.64.146 haproxy[11831]:  Proxy haproxy-statistics stopped (FE: 0 conns, BE: 0 conns).
Jun 14 19:42:57 192.168.64.146 haproxy[11831]:  Proxy http-in stopped (FE: 65 conns, BE: 0 conns).
Jun 14 19:42:57 192.168.64.146 haproxy[11831]:  Proxy backend_secure_test-app6 stopped (FE: 0 conns, BE: 0 conns).
Jun 14 19:42:57 192.168.64.146 haproxy[11831]:  Proxy backend_secure_test-log4j2 stopped (FE: 0 conns, BE: 65 conns).
Jun 14 19:42:57 192.168.64.146 haproxy[19038]:  backend_secure_test-app6/test-app6_secure_secure1 changed its FQDN from (null) to sqainfra-docker04.marathon.mesos by 'SRV record'
Jun 14 19:42:57 192.168.64.146 haproxy[19038]:  backend_secure_test-log4j2/test-log4j2_secure_secure1 changed its FQDN from (null) to sqainfra-docker03.marathon.mesos by 'SRV record'
Jun 14 19:42:57 192.168.64.146 haproxy[19038]:  backend_secure_test-log4j2/test-log4j2_secure_secure2 changed its FQDN from (null) to sqainfra-docker04.marathon.mesos by 'SRV record'
Jun 14 19:42:57 192.168.64.146 haproxy[19038]:  backend_secure_test-log4j2/test-log4j2_secure_secure3 changed its FQDN from (null) to sqainfra-docker05.marathon.mesos by 'SRV record'
Jun 14 19:43:00 192.168.64.146 haproxy[19038]:  backend_secure_test-app6/test-app6_secure_secure1 changed its IP from  to 192.168.64.149 by DNS cache.
Jun 14 19:43:00 192.168.64.146 haproxy[19038]:  backend_secure_test-log4j2/test-log4j2_secure_secure2 changed its IP from  to 192.168.64.149 by DNS cache.
Jun 14 19:43:00 192.168.64.146 haproxy[19038]:  backend_secure_test-log4j2/test-log4j2_secure_secure3 changed its IP from  to 192.168.64.163 by DNS cache.
Jun 14 19:43:00 192.168.64.146 haproxy[19038]:  backend_secure_test-log4j2/test-log4j2_secure_secure1 changed its IP from  to 192.168.64.207 by localdns/dns1.
Jun 14 19:43:00 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-app6/test-app6_secure_secure1 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms, status: 2/2 UP.
Jun 14 19:43:00 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-app6/test-app6_secure_secure2 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:00 192.168.64.146 haproxy[19038]:  Server backend_secure_test-app6/test-app6_secure_secure2 is DOWN. 9 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:00 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-app6/test-app6_secure_secure3 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:00 192.168.64.146 haproxy[19038]:  Server backend_secure_test-app6/test-app6_secure_secure3 is DOWN. 8 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-app6/test-app6_secure_secure4 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Server backend_secure_test-app6/test-app6_secure_secure4 is DOWN. 7 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-app6/test-app6_secure_secure5 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Server backend_secure_test-app6/test-app6_secure_secure5 is DOWN. 6 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-app6/test-app6_secure_secure6 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Server backend_secure_test-app6/test-app6_secure_secure6 is DOWN. 5 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-app6/test-app6_secure_secure7 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Server backend_secure_test-app6/test-app6_secure_secure7 is DOWN. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-app6/test-app6_secure_secure8 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Server backend_secure_test-app6/test-app6_secure_secure8 is DOWN. 3 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-app6/test-app6_secure_secure9 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Server backend_secure_test-app6/test-app6_secure_secure9 is DOWN. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-app6/test-app6_secure_secure10 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:01 192.168.64.146 haproxy[19038]:  Server backend_secure_test-app6/test-app6_secure_secure10 is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

Jun 14 19:43:02 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-log4j2/test-log4j2_secure_secure1 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms, status: 2/2 UP.
Jun 14 19:43:02 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-log4j2/test-log4j2_secure_secure2 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms, status: 2/2 UP.
Jun 14 19:43:02 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-log4j2/test-log4j2_secure_secure3 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms, status: 2/2 UP.
Jun 14 19:43:02 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-log4j2/test-log4j2_secure_secure4 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:02 192.168.64.146 haproxy[19038]:  Server backend_secure_test-log4j2/test-log4j2_secure_secure4 is DOWN. 9 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:02 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-log4j2/test-log4j2_secure_secure5 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:02 192.168.64.146 haproxy[19038]:  Server backend_secure_test-log4j2/test-log4j2_secure_secure5 is DOWN. 8 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-log4j2/test-log4j2_secure_secure6 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Server backend_secure_test-log4j2/test-log4j2_secure_secure6 is DOWN. 7 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-log4j2/test-log4j2_secure_secure7 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Server backend_secure_test-log4j2/test-log4j2_secure_secure7 is DOWN. 6 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-log4j2/test-log4j2_secure_secure8 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Server backend_secure_test-log4j2/test-log4j2_secure_secure8 is DOWN. 5 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-log4j2/test-log4j2_secure_secure9 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Server backend_secure_test-log4j2/test-log4j2_secure_secure9 is DOWN. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_test-log4j2/test-log4j2_secure_secure10 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Server backend_secure_test-log4j2/test-log4j2_secure_secure10 is DOWN. 3 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:43:03 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_scarey-log4j2/scarey-log4j2_secure_secure1 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms, status: 1/3 DOWN.
Jun 14 19:43:13 192.168.64.146 haproxy[19038]:  Health check for server backend_secure_scarey-log4j2/scarey-log4j2_secure_secure1 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms, status: 2/3 DOWN.

Thanks,
Steve


#2

Hi,

First, your tests started at 19:42:48 and HAProxy was started at 19:42:57, according to your logs, so it is complicated to correlate anything.

That said, I guess your configuration misses a “init-addr none”. Could you please send an anonymized version of your config?


#3

Thank you for the response.

I’m not specifying init-addr so server IPs should hopefully get preserved across reloads. This is with version 1.8.8. My config looks like:

global
	maxconn 5000
	nbproc 1
	log 127.0.0.1 local1			# Traffic logs
	log 127.0.0.1 local2 notice 		# Event logs
	pidfile /etc/pulsar_haproxy.pid
    ssl-default-bind-ciphers ...
    ssl-default-bind-options ...
    ssl-default-server-ciphers ...
    ssl-default-server-options ...
	stats socket /var/run/haproxy.sock mode 600 level admin
	stats timeout 2m

defaults
	log global 							# log all proxy instances
	mode http 							# layer 7 proxy

    log-format ...

	retries 4
	timeout connect 5s						
	timeout client 15s						
	timeout server 180s						
	timeout check 8s						
	timeout http-keep-alive 301s            
	timeout http-request 5s

    option clitcpka
	option redispatch						
	option abortonclose 					
	option forwardfor 						
	balance roundrobin						

	option forwardfor except 127.0.0.1				
	option log-health-checks


resolvers localdns
    accepted_payload_size 8192
    nameserver dns1 localhost:53
    resolve_retries       3
    timeout resolve       3s
    timeout retry         3s
    hold other           30s
    hold refused         30s
    hold nx              30s
    hold timeout         30s
    hold valid           10s
    hold obsolete        30s

frontend http-in
	bind *:80
	bind *:443 ssl crt ....pem

	mode http
	acl acl_scarey-log4j2 path_beg /scarey-log4j2/
	use_backend backend_scarey-log4j2 if acl_scarey-log4j2	

backend backend_scarey-log4j2
	balance roundrobin
	option http-server-close
	option httpchk GET /scarey-log4j2/health HTTP/1.1\r\nHost:\ zzz.com
	server-template scarey-log4j2_secure 10 _scarey-log4j2_http._tcp.x.y resolvers localdns resolve-prefer ipv4 maxconn 64 check rise 3 fall 2 ssl  ca-file /...pem check port 12345 inter 10000    

Here’s the event log filtered for that backend across reloads:

root@static:/var/logs/proxylogs# grep "Jun 15 15:3[1234]:.." haproxy_events.log | grep scarey-log4j2
Jun 15 15:32:36 192.168.24.240 haproxy[2910]:  Proxy backend_scarey-log4j2 started.
Jun 15 15:32:37 192.168.24.240 haproxy[9664]:  Stopping backend backend_scarey-log4j2 in 0 ms.
Jun 15 15:32:37 192.168.24.240 haproxy[9664]:  Proxy backend_scarey-log4j2 stopped (FE: 0 conns, BE: 3886 conns).
Jun 15 15:32:37 192.168.24.240 haproxy[17629]:  Health check for server backend_scarey-log4j2/scarey-log4j2_secure1 failed, reason: Socket error, check duration: 0ms, status: 0/3 DOWN.
...
Jun 15 15:32:37 192.168.24.240 haproxy[17629]:  backend_scarey-log4j2/scarey-log4j2_secure1 changed its FQDN from (null) to docker06.x.y by 'SRV record'
...
Jun 15 15:32:37 192.168.24.240 haproxy[17629]:  backend backend_scarey-log4j2 has no server available!
Jun 15 15:32:37 192.168.24.240 haproxy[17629]:  backend_scarey-log4j2/scarey-log4j2_secure1 changed its IP from  to 192.168.0.2 by localdns/dns1.
Jun 15 15:32:42 192.168.24.240 haproxy[17629]:  Health check for server backend_scarey-log4j2/scarey-log4j2_secure1 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms, status: 1/3 DOWN.
Jun 15 15:32:52 192.168.24.240 haproxy[17629]:  Health check for server backend_scarey-log4j2/scarey-log4j2_secure1 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms, status: 2/3 DOWN.
Jun 15 15:33:02 192.168.24.240 haproxy[17629]:  Health check for server backend_scarey-log4j2/scarey-log4j2_secure1 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms, status: 2/2 UP.
Jun 15 15:33:02 192.168.24.240 haproxy[17629]:  Server backend_scarey-log4j2/scarey-log4j2_secure1 is UP. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

The original PID 9664 was running before the reconfig and the above log shows the startup of the new process and shutdown of the old. We see 503’s from Jun 15 15:32:37 through Jun 15 15:33:01 due to the backend needing to pass 3 healthchecks?

Thanks,
Steve


#4

ok, I think I understand what’s happening.
After the reload, since the server has no IP, it is administratively in MAINT. Then DNS get it an IP, which wakes it up. And in such case, HAProxy has to perform the ‘rise’ number of health check at interval ‘inter’.

In such case, 2 options:

  • you can play with ‘fastinter’ and ‘downinter’ settings to trigger faster health check when server is DOWN or in transition

(more info here: https://www.haproxy.com/fr/documentation/aloha/7-0/traffic-management/lb-layer7/health-checks/#health-check-parameters)

  • you can use the server-state file and dump server state to a file before reloading, then trigger the reload. The new process will then learn server IP address and FQDN from the server state file.

#5

Hi @Baptiste,

I’m kind of having the same issue. I’ve also tried your suggestions from your last response but i’m still seeing my service (and other services) being impacted and showing up as unavailable during reloads. I can describe to you what i’m observing based on the logs.

I have a watch command that continuously curls a service endpoint and often times on reloads, it returns an HTTP 503 Service Unavailable.

$ curl user-server
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

From the logs, i see the following:

Jun 24 21:51:51 ip-1-2-3-206 haproxy[25775]: Proxy user-server started.
Jun 24 21:51:51 ip-1-2-3-206 haproxy[25775]: Proxy user-server started.
Jun 24 21:51:51 ip-1-2-3-206 haproxy[25779]: Stopping backend user-server in 0 ms.
Jun 24 21:51:51 ip-1-2-3-206 haproxy[25779]: Stopping backend user-server in 0 ms.

Jun 24 21:51:51 ip-1-2-3-206 haproxy[27147]: user-server/user-server1 changed its FQDN from (null) to test-ecs-i-xxx.node.eu-west-1.consul by 'SRV record'

Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server1 is DOWN, reason: Socket error, check duration: 0ms. 9 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server1 is DOWN, reason: Socket error, check duration: 0ms. 9 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server2 is DOWN, reason: Socket error, check duration: 0ms. 8 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server2 is DOWN, reason: Socket error, check duration: 0ms. 8 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server3 is DOWN, reason: Socket error, check duration: 0ms. 7 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server3 is DOWN, reason: Socket error, check duration: 0ms. 7 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server4 is DOWN, reason: Socket error, check duration: 0ms. 6 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server4 is DOWN, reason: Socket error, check duration: 0ms. 6 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server5 is DOWN, reason: Socket error, check duration: 0ms. 5 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server5 is DOWN, reason: Socket error, check duration: 0ms. 5 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server6 is DOWN, reason: Socket error, check duration: 0ms. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server6 is DOWN, reason: Socket error, check duration: 0ms. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server7 is DOWN, reason: Socket error, check duration: 0ms. 3 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server7 is DOWN, reason: Socket error, check duration: 0ms. 3 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server8 is DOWN, reason: Socket error, check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server8 is DOWN, reason: Socket error, check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server9 is DOWN, reason: Socket error, check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server9 is DOWN, reason: Socket error, check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server10 is DOWN, reason: Socket error, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server10 is DOWN, reason: Socket error, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: backend user-server has no server available!
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: backend user-server has no server available!

Jun 24 21:51:52 ip-1-2-3-206 haproxy.requests[27147]: {"request_time":"24/Jun/2018:21:51:52.821","host":"ip-1-2-3-206","protocol":"http","http_status":503,"user_agent":"curl/7.47.0","unique_id":"","headers":"{curl/7.47.0||user-server}","endpoint":"/","backend":"user-server","backend_name":"user-server","http_method":"GET","upstream_response_time":-1,"upstream_connect_time":-1,"bytes_read":213,"upstream_addr":"-","source_addr":"-","retries":"0","bytes_uploaded":75,"session_duration":0,"termination_state":"SC","http_query_params":"","accept_time":0,"idle_time":0,"client_time":0,"wait_time":-1,"download_time":-1,"active_time":0}

Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server1 changed its IP from  to x.x.x.177 by DNS cache.
Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server1 changed its IP from  to x.x.x.177 by DNS cache.
Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server3 changed its IP from  to x.x.x.206 by DNS cache.
Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server3 changed its IP from  to x.x.x.206 by DNS cache.
Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server2 changed its IP from  to x.x.x.174 by DNS cache.
Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server2 changed its IP from  to x.x.x.174 by DNS cache.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server2 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server2 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server1 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server1 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server3 is UP, reason: Layer4 check passed, check duration: 0ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server3 is UP, reason: Layer4 check passed, check duration: 0ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

Jun 24 21:52:04 ip-1-2-3-206 haproxy.requests[27147]: {"request_time":"24/Jun/2018:21:52:04.892","host":"ip-1-2-3-206","protocol":"http","http_status":401,"user_agent":"curl/7.47.0","unique_id":"","headers":"{curl/7.47.0||user-server}","endpoint":"/","backend":"user-server","backend_name":"user-server","http_method":"GET","upstream_response_time":2,"upstream_connect_time":0,"bytes_read":144,"upstream_addr":"x.x.x.174","source_addr":"x.x.x.206","retries":"0","bytes_uploaded":75,"session_duration":2,"termination_state":"--","http_query_params":"","accept_time":0,"idle_time":0,"client_time":0,"wait_time":0,"download_time":0,"active_time":2}

These happened the minute haproxy is reloaded. It’s happening on and off, not all the time.

From my observation, there are two things that i observe, which from my point of view shouldn’t happen.

1. The server IP being changed.
- I’m using Consul for service discovery and i think this happens because the set of nodes returned is randomized each time from Consul’s DNS. Could you confirm if this is expected behaviour?

2. The server being marked as DOWN even though the service is alive and running on the host with that IP and listening on that port.

  • I don’t understand why this happens.

Here is my config:

global
log /dev/log len 65535 local0 info alert
log /dev/log len 65535 local1 notice alert
user haproxy
group haproxy
nbproc 1
nbthread 1
stats socket /var/run/haproxy.sock mode 660 level admin
server-state-file /var/lib/haproxy/server-state
stats timeout 2m
master-worker

defaults
log global
mode http
option httplog
timeout connect 5s
timeout client 30s
timeout server 30s
timeout http-request 30s
timeout http-keep-alive 60s
timeout queue 120s
timeout check 10s
retries 10
option redispatch
option forwardfor
maxconn 10000
load-server-state-from-file global
default-server init-addr none fastinter 1s rise 2 fall 2 on-error fastinter
no option http-server-close
option tcp-smart-connect
option tcp-smart-accept
option splice-auto
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http

frontend haproxy-ping
bind :7070
mode http
monitor-uri /ping
http-request set-log-level silent
errorfile 200 /etc/haproxy/ping.http

frontend stats
bind *:1936
mode http
option forceclose
stats enable
stats uri /
stats hide-version
stats show-legends
stats show-desc
stats show-node
stats realm Haproxy\ Statistics

resolvers consul
nameserver consul 127.0.0.1:8600
accepted_payload_size 8192
resolve_retries 3
timeout retry 1s
hold valid 10s

frontend http-in
    bind *:80
    log-tag haproxy.requests
    capture request header User-Agent len 30
    capture request header X-Request-ID len 36
    capture request header Host len 32
    log-format "{\"request_time\":\"%t\",\"host\":\"%H\",\"protocol\":\"http\",\"http_status\":%ST,\"user_agent\":%{+Q}[capture.req.hdr(0)],\"unique_id\":%{+Q}[capture.req.hdr(1)],\"headers\":\"%hr\",\"endpoint\":\"%HP\",\"backend\":\"%b\",\"backend_name\":%{+Q}[capture.req.hdr(2)],\"http_method\":\"%HM\",\"upstream_response_time\":%Tr,\"upstream_connect_time\":%Tc,\"bytes_read\":%B,\"upstream_addr\":\"%si\",\"source_addr\":\"%bi\",\"retries\":\"%rc\",\"bytes_uploaded\":%U,\"session_duration\":%Tt,\"termination_state\":\"%ts\",\"http_query_params\":\"%HQ\",\"accept_time\":%Th,\"idle_time\":%Ti,\"client_time\":%TR,\"wait_time\":%Tw,\"download_time\":%Td,\"active_time\":%Ta}"

    use_backend user-server if { hdr(Host) -i user-server user-server.internal.example.com  }


backend user-server
    mode http
    server-template user-server 10 _user-server._tcp.service.consul resolvers consul resolve-prefer ipv4 check

When i need to do the reload, i do it with the following command:

socat /var/run/haproxy.sock - <<< "show servers state" > /var/lib/haproxy/server-state && /bin/systemctl reload haproxy

And my haproxy is built with the following options:

$ haproxy -vv
HA-Proxy version 1.8.9-2b5ef62 2018/06/11
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_THREAD=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_TFO=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
	[SPOE] spoe
	[COMP] compression
	[TRACE] trace

I’ve tried tweaking this quite a lot and i’ve been continuously reading up on articles around this from other posts and your responses to them, but i’m still not progressing forward in minimizing downtime of my service on reloads. Any help and suggestion would be greatly appreciated.

If you need help with testing from my end, please don’t hesitate to ask.

Really looking forward to nailing this once and for all.

Thanks!


#6

Hi all,

Just to let you know that I think I found the cause of the issue but I don’t have a fix yet.

I’ll come back to you this week with more info and hopefully a fix.

The issue seem to be in srv_init_addr(), because srv->hostname is not set (null).


#7

Hi @Baptiste

Thanks for responding so quickly. I just wanted to confirm what exactly would it fix, like point 1 or 2 as mentioned above in my post or both?

I think if the IP changes for the service’s DNS record (i.e. point 1), it’s probably not such a huge issue compared to it being marked as DOWN (i.e. point 2). It would be great if point 2 is addressed as well because that’s causing the downtime and unavailability of the service (i.e. 503s).

Just wanted to make sure we’re on the same page. Feel free to respond when ever you can or if you need more info.

Thanks!


#8

Hi guys,

So, I had some time to dig deeper on that issue, and I can clearly say that in the current state, SRV records is not compatible with server state file.

There is unfortunately no workarounds for now.

I’m currently evaluating what’s missing in servers state file, so the new process can recover from the old one.

(we obviously miss the SRV record itself, maybe more).

I keep you updated.

Baptiste


#9

Hi,

This will fix point #2 and point #1 should not happen at all.

The new server states file format will look like this:

1
# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port srvrecord
3 be 1 srv1 192.168.0.2 0 0 2 1 64 5 2 0 38 0 0 0 A2.tld 80 _http._tcp.srv.tld
3 be 2 srv2 192.168.0.3 0 0 2 1 64 5 2 0 38 0 0 0 A3.tld 80 _http._tcp.srv.tld
3 be 3 srv3 192.168.0.1 0 0 2 1 64 5 2 0 38 0 0 0 A1.tld 80 _http._tcp.srv.tld
3 be 4 srv6 192.168.0.4 0 0 2 1 64 7 2 0 6 0 0 0 A4.tld 80 _http._tcp.srv.tld
4 be2 1 srv1 192.168.0.2 0 0 2 1 64 5 2 0 38 0 0 0 A2.tld 80 _http._tcp.srv.tld
4 be2 2 srv6 192.168.0.3 0 0 2 1 64 7 2 0 6 0 0 0 A3.tld 80 _http._tcp.srv.tld
4 be2 3 srv7 192.168.0.1 0 32 1 1 64 7 2 0 14 0 0 0 A1.tld 80 -

See the latest field, it contains the SRV record, which has been used to get the server’s fqdn, which itself is used to get the IP address.
With this format (and a small patch in the server state file loader), everything comes back normally in the new process.
When a resolution is enabled and not issued from srv record, you can see a dash ‘-’ as the last character.

I’ll forward you a patch as soon as possible, so you can give it a try.

Baptiste


#10

That would be great @Baptiste

Let me know when i can try it. I can test it out then.


#11

Hi,

I sent a patch on the mailing list.

Trying to attach it, though I’m not sure if it will work well.

Baptiste


#12

Here is the link:
https://www.mail-archive.com/haproxy@formilux.org/msg30589.html

Drop me a mail to bedis9 gmail if you’re not on the ML and can’t apply the patch properly.


#13

Hi @Baptiste

I don’t see any attachment for the patch and not really sure how to apply it. I’m also not on the mailing list. However i did send you an email. Could you guide me either here or over email how i can apply and use the patch?


#14

Hi @Baptiste

Just some update. I sent you an email and i tried running the patch - but i think i must have not done it right - but after applying it, it just didn’t work at all, i.e. the service was constantly producing 503s even thought it was up and running.

Let me know if you’ve got my email, i explained how i applied the patch, so maybe you can correct me if i did it wrong.

Awaiting your response on this issue.

Thanks!


#15

I applied the patch to 1.8.8 and tried some quick tests. I can see the new data in the server state file but it still had the same behavior after reload and would wait for the health checks to pass before sending requests and cause an outage. Please let me know what other info you need or if I patched the wrong version.

1.8.8 no patch

Requests working - show servers state

# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port
6 backend_testapp 1 testapp1 192.168.64.146 2 0 1 1 43 15 3 4 6 0 0 0 docker01.marathon.mesos 443
6 backend_testapp 2 testapp2  0 0 1 1 65 5 2 0 6 0 0 0 - 0
6 backend_testapp 3 testapp3  0 0 1 1 62 5 2 0 6 0 0 0 - 0
6 backend_testapp 4 testapp4  0 0 1 1 60 5 2 0 6 0 0 0 - 0

After reload things not working - show servers state

# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port
6 backend_testapp 1 testapp1 192.168.64.146 0 0 1 1 20 15 3 1 6 0 0 0 docker01.marathon.mesos 443
6 backend_testapp 2 testapp2  0 0 1 1 17 5 2 0 6 0 0 0 - 0
6 backend_testapp 3 testapp3  0 0 1 1 14 5 2 0 6 0 0 0 - 0
6 backend_testapp 4 testapp4  0 0 1 1 12 5 2 0 6 0 0 0 - 0


patch -p1 -b < ../haproxy-srv.patch 
patching file include/types/server.h
patching file src/proxy.c
Hunk #2 succeeded at 1456 (offset -3 lines).
patching file src/server.c
Hunk #1 succeeded at 2640 (offset -38 lines).
Hunk #2 succeeded at 2664 (offset -38 lines).
Hunk #3 succeeded at 2797 (offset -38 lines).
Hunk #4 succeeded at 2935 (offset -39 lines).
Hunk #5 succeeded at 3224 (offset -39 lines).


Requests working - show servers state

# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_portsrvrecord
6 backend_testapp 1 testapp1 192.168.64.146 2 0 1 1 8 15 3 4 6 0 0 0 docker01.marathon.mesos 443 _testapp._tcp.marathon.mesos
6 backend_testapp 2 testapp2  0 0 1 1 30 5 2 0 6 0 0 0 - 0 _testapp._tcp.marathon.mesos
6 backend_testapp 3 testapp3  0 0 1 1 27 5 2 0 6 0 0 0 - 0 _testapp._tcp.marathon.mesos
6 backend_testapp 4 testapp4  0 0 1 1 25 5 2 0 6 0 0 0 - 0 _testapp._tcp.marathon.mesos


After reload things not working - show servers state

# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_portsrvrecord
6 backend_testapp 1 testapp1 192.168.64.146 0 0 1 1 7 15 3 1 6 0 0 0 docker01.marathon.mesos 443 _testapp._tcp.marathon.mesos
6 backend_testapp 2 testapp2  0 0 1 1 5 5 2 0 6 0 0 0 - 0 _testapp._tcp.marathon.mesos
6 backend_testapp 3 testapp3  0 0 1 1 1 5 2 0 6 0 0 0 - 0 _testapp._tcp.marathon.mesos
6 backend_testapp 4 testapp4  2 0 1 1 7 1 0 3 6 0 0 0 - 0 _testapp._tcp.marathon.mesos

Server-Template and Hot Reload?
#16

Here my finding on my system.

95 default_backend 1 varnish1 10.100.40.107 2 0 1 1 22 15 3 4 6 0 0 0 test 4294934545 tcp.varnish

Looks like the port is (weird?) I beleive the port is 32785 but somehow in the state file its an unsigned weird value…

In the code there is this, that prevent to load the server successfully… due to the msg error being not empty…

port_str = params[14];
                        if (port_str) {
                                port = strl2uic(port_str, strlen(port_str));
                                if (port > USHRT_MAX) {
                                      chunk_appendf(msg, ", invalid srv_port value '%s'", port_str);
                                      port_str = NULL;
                                }
                        }

#17

@FrancisL What version did you apply the patch to? I patched 1.8.8 and the port looks ok.


#18

[quote=“scarey, post:17, topic:2625”]I
you apply the patch to? I patched 1.8.8 and the port looks ok.
[/quote]

I actually used the latest version from github… 1.8.9 ? i believe…

I actually found out some of the issues wans’t the port itself, that port is an unsigned value which is fine… (i suppose).

What I had found is that in the server.c file I had to comment this, otherwise somehow the “Process” would hang… returning 503’s …

			/*
			// prepare DNS resolution for this server  (but aint this has already been done by the server-template function?)
			res = srv_prepare_for_resolution(srv, fqdn);
			if (res == -1) {
				ha_alert("could not allocate memory for DNS REsolution for server ... '%s'\n", srv->id);
				chunk_appendf(msg, ", can't allocate memory for DNS resolution for server '%s'", srv->id);
				HA_SPIN_UNLOCK(SERVER_LOCK, &srv->lock);
				goto out;
			}
			*/

#19

Hi guys,

You’re suppose to apply the patches to the -dev code. We’ll do the backport in 1.8 and see if this backports require some updates.

Could you confirm what’s the latest status on your side: working or not working (on 1.9 of course)?

Baptiste


#20

Hi @Baptiste

For me it doesn’t work (on 1.9, dev). I’ve also emailed you separately about this issue with logs. I’ll send you a reminder via email:

haproxy -vv
HA-Proxy version 1.9-dev0-e115e0-478 2018/07/12
Copyright 2000-2017 Willy Tarreau &lt;willy@haproxy.org&gt;

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_THREAD=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_TFO=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8

Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK

Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace 

I keep getting a 503:

>> curl user-server
503 Service Unavailable
No server is available to handle this request.