Config reload with dynamic service discovery via DNS

Hi @Baptiste,

I’m kind of having the same issue. I’ve also tried your suggestions from your last response but i’m still seeing my service (and other services) being impacted and showing up as unavailable during reloads. I can describe to you what i’m observing based on the logs.

I have a watch command that continuously curls a service endpoint and often times on reloads, it returns an HTTP 503 Service Unavailable.

$ curl user-server
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

From the logs, i see the following:

Jun 24 21:51:51 ip-1-2-3-206 haproxy[25775]: Proxy user-server started.
Jun 24 21:51:51 ip-1-2-3-206 haproxy[25775]: Proxy user-server started.
Jun 24 21:51:51 ip-1-2-3-206 haproxy[25779]: Stopping backend user-server in 0 ms.
Jun 24 21:51:51 ip-1-2-3-206 haproxy[25779]: Stopping backend user-server in 0 ms.

Jun 24 21:51:51 ip-1-2-3-206 haproxy[27147]: user-server/user-server1 changed its FQDN from (null) to test-ecs-i-xxx.node.eu-west-1.consul by 'SRV record'

Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server1 is DOWN, reason: Socket error, check duration: 0ms. 9 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server1 is DOWN, reason: Socket error, check duration: 0ms. 9 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server2 is DOWN, reason: Socket error, check duration: 0ms. 8 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server2 is DOWN, reason: Socket error, check duration: 0ms. 8 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server3 is DOWN, reason: Socket error, check duration: 0ms. 7 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server3 is DOWN, reason: Socket error, check duration: 0ms. 7 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server4 is DOWN, reason: Socket error, check duration: 0ms. 6 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server4 is DOWN, reason: Socket error, check duration: 0ms. 6 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server5 is DOWN, reason: Socket error, check duration: 0ms. 5 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server5 is DOWN, reason: Socket error, check duration: 0ms. 5 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server6 is DOWN, reason: Socket error, check duration: 0ms. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server6 is DOWN, reason: Socket error, check duration: 0ms. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server7 is DOWN, reason: Socket error, check duration: 0ms. 3 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server7 is DOWN, reason: Socket error, check duration: 0ms. 3 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server8 is DOWN, reason: Socket error, check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server8 is DOWN, reason: Socket error, check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server9 is DOWN, reason: Socket error, check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server9 is DOWN, reason: Socket error, check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server10 is DOWN, reason: Socket error, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server10 is DOWN, reason: Socket error, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: backend user-server has no server available!
Jun 24 21:51:52 ip-1-2-3-206 haproxy[27147]: backend user-server has no server available!

Jun 24 21:51:52 ip-1-2-3-206 haproxy.requests[27147]: {"request_time":"24/Jun/2018:21:51:52.821","host":"ip-1-2-3-206","protocol":"http","http_status":503,"user_agent":"curl/7.47.0","unique_id":"","headers":"{curl/7.47.0||user-server}","endpoint":"/","backend":"user-server","backend_name":"user-server","http_method":"GET","upstream_response_time":-1,"upstream_connect_time":-1,"bytes_read":213,"upstream_addr":"-","source_addr":"-","retries":"0","bytes_uploaded":75,"session_duration":0,"termination_state":"SC","http_query_params":"","accept_time":0,"idle_time":0,"client_time":0,"wait_time":-1,"download_time":-1,"active_time":0}

Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server1 changed its IP from  to x.x.x.177 by DNS cache.
Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server1 changed its IP from  to x.x.x.177 by DNS cache.
Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server3 changed its IP from  to x.x.x.206 by DNS cache.
Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server3 changed its IP from  to x.x.x.206 by DNS cache.
Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server2 changed its IP from  to x.x.x.174 by DNS cache.
Jun 24 21:52:01 ip-1-2-3-206 haproxy[27147]: user-server/user-server2 changed its IP from  to x.x.x.174 by DNS cache.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server2 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server2 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server1 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server1 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server3 is UP, reason: Layer4 check passed, check duration: 0ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 24 21:52:03 ip-1-2-3-206 haproxy[27147]: Server user-server/user-server3 is UP, reason: Layer4 check passed, check duration: 0ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

Jun 24 21:52:04 ip-1-2-3-206 haproxy.requests[27147]: {"request_time":"24/Jun/2018:21:52:04.892","host":"ip-1-2-3-206","protocol":"http","http_status":401,"user_agent":"curl/7.47.0","unique_id":"","headers":"{curl/7.47.0||user-server}","endpoint":"/","backend":"user-server","backend_name":"user-server","http_method":"GET","upstream_response_time":2,"upstream_connect_time":0,"bytes_read":144,"upstream_addr":"x.x.x.174","source_addr":"x.x.x.206","retries":"0","bytes_uploaded":75,"session_duration":2,"termination_state":"--","http_query_params":"","accept_time":0,"idle_time":0,"client_time":0,"wait_time":0,"download_time":0,"active_time":2}

These happened the minute haproxy is reloaded. It’s happening on and off, not all the time.

From my observation, there are two things that i observe, which from my point of view shouldn’t happen.

1. The server IP being changed.
- I’m using Consul for service discovery and i think this happens because the set of nodes returned is randomized each time from Consul’s DNS. Could you confirm if this is expected behaviour?

2. The server being marked as DOWN even though the service is alive and running on the host with that IP and listening on that port.

  • I don’t understand why this happens.

Here is my config:

global
log /dev/log len 65535 local0 info alert
log /dev/log len 65535 local1 notice alert
user haproxy
group haproxy
nbproc 1
nbthread 1
stats socket /var/run/haproxy.sock mode 660 level admin
server-state-file /var/lib/haproxy/server-state
stats timeout 2m
master-worker

defaults
log global
mode http
option httplog
timeout connect 5s
timeout client 30s
timeout server 30s
timeout http-request 30s
timeout http-keep-alive 60s
timeout queue 120s
timeout check 10s
retries 10
option redispatch
option forwardfor
maxconn 10000
load-server-state-from-file global
default-server init-addr none fastinter 1s rise 2 fall 2 on-error fastinter
no option http-server-close
option tcp-smart-connect
option tcp-smart-accept
option splice-auto
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http

frontend haproxy-ping
bind :7070
mode http
monitor-uri /ping
http-request set-log-level silent
errorfile 200 /etc/haproxy/ping.http

frontend stats
bind *:1936
mode http
option forceclose
stats enable
stats uri /
stats hide-version
stats show-legends
stats show-desc
stats show-node
stats realm Haproxy\ Statistics

resolvers consul
nameserver consul 127.0.0.1:8600
accepted_payload_size 8192
resolve_retries 3
timeout retry 1s
hold valid 10s

frontend http-in
    bind *:80
    log-tag haproxy.requests
    capture request header User-Agent len 30
    capture request header X-Request-ID len 36
    capture request header Host len 32
    log-format "{\"request_time\":\"%t\",\"host\":\"%H\",\"protocol\":\"http\",\"http_status\":%ST,\"user_agent\":%{+Q}[capture.req.hdr(0)],\"unique_id\":%{+Q}[capture.req.hdr(1)],\"headers\":\"%hr\",\"endpoint\":\"%HP\",\"backend\":\"%b\",\"backend_name\":%{+Q}[capture.req.hdr(2)],\"http_method\":\"%HM\",\"upstream_response_time\":%Tr,\"upstream_connect_time\":%Tc,\"bytes_read\":%B,\"upstream_addr\":\"%si\",\"source_addr\":\"%bi\",\"retries\":\"%rc\",\"bytes_uploaded\":%U,\"session_duration\":%Tt,\"termination_state\":\"%ts\",\"http_query_params\":\"%HQ\",\"accept_time\":%Th,\"idle_time\":%Ti,\"client_time\":%TR,\"wait_time\":%Tw,\"download_time\":%Td,\"active_time\":%Ta}"

    use_backend user-server if { hdr(Host) -i user-server user-server.internal.example.com  }


backend user-server
    mode http
    server-template user-server 10 _user-server._tcp.service.consul resolvers consul resolve-prefer ipv4 check

When i need to do the reload, i do it with the following command:

socat /var/run/haproxy.sock - <<< "show servers state" > /var/lib/haproxy/server-state && /bin/systemctl reload haproxy

And my haproxy is built with the following options:

$ haproxy -vv
HA-Proxy version 1.8.9-2b5ef62 2018/06/11
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_THREAD=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_TFO=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
	[SPOE] spoe
	[COMP] compression
	[TRACE] trace

I’ve tried tweaking this quite a lot and i’ve been continuously reading up on articles around this from other posts and your responses to them, but i’m still not progressing forward in minimizing downtime of my service on reloads. Any help and suggestion would be greatly appreciated.

If you need help with testing from my end, please don’t hesitate to ask.

Really looking forward to nailing this once and for all.

Thanks!

Hi all,

Just to let you know that I think I found the cause of the issue but I don’t have a fix yet.

I’ll come back to you this week with more info and hopefully a fix.

The issue seem to be in srv_init_addr(), because srv->hostname is not set (null).

Hi @Baptiste

Thanks for responding so quickly. I just wanted to confirm what exactly would it fix, like point 1 or 2 as mentioned above in my post or both?

I think if the IP changes for the service’s DNS record (i.e. point 1), it’s probably not such a huge issue compared to it being marked as DOWN (i.e. point 2). It would be great if point 2 is addressed as well because that’s causing the downtime and unavailability of the service (i.e. 503s).

Just wanted to make sure we’re on the same page. Feel free to respond when ever you can or if you need more info.

Thanks!

Hi guys,

So, I had some time to dig deeper on that issue, and I can clearly say that in the current state, SRV records is not compatible with server state file.

There is unfortunately no workarounds for now.

I’m currently evaluating what’s missing in servers state file, so the new process can recover from the old one.

(we obviously miss the SRV record itself, maybe more).

I keep you updated.

Baptiste

Hi,

This will fix point #2 and point #1 should not happen at all.

The new server states file format will look like this:

1
# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port srvrecord
3 be 1 srv1 192.168.0.2 0 0 2 1 64 5 2 0 38 0 0 0 A2.tld 80 _http._tcp.srv.tld
3 be 2 srv2 192.168.0.3 0 0 2 1 64 5 2 0 38 0 0 0 A3.tld 80 _http._tcp.srv.tld
3 be 3 srv3 192.168.0.1 0 0 2 1 64 5 2 0 38 0 0 0 A1.tld 80 _http._tcp.srv.tld
3 be 4 srv6 192.168.0.4 0 0 2 1 64 7 2 0 6 0 0 0 A4.tld 80 _http._tcp.srv.tld
4 be2 1 srv1 192.168.0.2 0 0 2 1 64 5 2 0 38 0 0 0 A2.tld 80 _http._tcp.srv.tld
4 be2 2 srv6 192.168.0.3 0 0 2 1 64 7 2 0 6 0 0 0 A3.tld 80 _http._tcp.srv.tld
4 be2 3 srv7 192.168.0.1 0 32 1 1 64 7 2 0 14 0 0 0 A1.tld 80 -

See the latest field, it contains the SRV record, which has been used to get the server’s fqdn, which itself is used to get the IP address.
With this format (and a small patch in the server state file loader), everything comes back normally in the new process.
When a resolution is enabled and not issued from srv record, you can see a dash ‘-’ as the last character.

I’ll forward you a patch as soon as possible, so you can give it a try.

Baptiste

1 Like

That would be great @Baptiste

Let me know when i can try it. I can test it out then.

Hi,

I sent a patch on the mailing list.

Trying to attach it, though I’m not sure if it will work well.

Baptiste

Here is the link:
https://www.mail-archive.com/haproxy@formilux.org/msg30589.html

Drop me a mail to bedis9 gmail if you’re not on the ML and can’t apply the patch properly.

Hi @Baptiste

I don’t see any attachment for the patch and not really sure how to apply it. I’m also not on the mailing list. However i did send you an email. Could you guide me either here or over email how i can apply and use the patch?

Hi @Baptiste

Just some update. I sent you an email and i tried running the patch - but i think i must have not done it right - but after applying it, it just didn’t work at all, i.e. the service was constantly producing 503s even thought it was up and running.

Let me know if you’ve got my email, i explained how i applied the patch, so maybe you can correct me if i did it wrong.

Awaiting your response on this issue.

Thanks!

I applied the patch to 1.8.8 and tried some quick tests. I can see the new data in the server state file but it still had the same behavior after reload and would wait for the health checks to pass before sending requests and cause an outage. Please let me know what other info you need or if I patched the wrong version.

1.8.8 no patch

Requests working - show servers state

# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port
6 backend_testapp 1 testapp1 192.168.64.146 2 0 1 1 43 15 3 4 6 0 0 0 docker01.marathon.mesos 443
6 backend_testapp 2 testapp2  0 0 1 1 65 5 2 0 6 0 0 0 - 0
6 backend_testapp 3 testapp3  0 0 1 1 62 5 2 0 6 0 0 0 - 0
6 backend_testapp 4 testapp4  0 0 1 1 60 5 2 0 6 0 0 0 - 0

After reload things not working - show servers state

# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port
6 backend_testapp 1 testapp1 192.168.64.146 0 0 1 1 20 15 3 1 6 0 0 0 docker01.marathon.mesos 443
6 backend_testapp 2 testapp2  0 0 1 1 17 5 2 0 6 0 0 0 - 0
6 backend_testapp 3 testapp3  0 0 1 1 14 5 2 0 6 0 0 0 - 0
6 backend_testapp 4 testapp4  0 0 1 1 12 5 2 0 6 0 0 0 - 0


patch -p1 -b < ../haproxy-srv.patch 
patching file include/types/server.h
patching file src/proxy.c
Hunk #2 succeeded at 1456 (offset -3 lines).
patching file src/server.c
Hunk #1 succeeded at 2640 (offset -38 lines).
Hunk #2 succeeded at 2664 (offset -38 lines).
Hunk #3 succeeded at 2797 (offset -38 lines).
Hunk #4 succeeded at 2935 (offset -39 lines).
Hunk #5 succeeded at 3224 (offset -39 lines).


Requests working - show servers state

# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_portsrvrecord
6 backend_testapp 1 testapp1 192.168.64.146 2 0 1 1 8 15 3 4 6 0 0 0 docker01.marathon.mesos 443 _testapp._tcp.marathon.mesos
6 backend_testapp 2 testapp2  0 0 1 1 30 5 2 0 6 0 0 0 - 0 _testapp._tcp.marathon.mesos
6 backend_testapp 3 testapp3  0 0 1 1 27 5 2 0 6 0 0 0 - 0 _testapp._tcp.marathon.mesos
6 backend_testapp 4 testapp4  0 0 1 1 25 5 2 0 6 0 0 0 - 0 _testapp._tcp.marathon.mesos


After reload things not working - show servers state

# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_portsrvrecord
6 backend_testapp 1 testapp1 192.168.64.146 0 0 1 1 7 15 3 1 6 0 0 0 docker01.marathon.mesos 443 _testapp._tcp.marathon.mesos
6 backend_testapp 2 testapp2  0 0 1 1 5 5 2 0 6 0 0 0 - 0 _testapp._tcp.marathon.mesos
6 backend_testapp 3 testapp3  0 0 1 1 1 5 2 0 6 0 0 0 - 0 _testapp._tcp.marathon.mesos
6 backend_testapp 4 testapp4  2 0 1 1 7 1 0 3 6 0 0 0 - 0 _testapp._tcp.marathon.mesos

Here my finding on my system.

95 default_backend 1 varnish1 10.100.40.107 2 0 1 1 22 15 3 4 6 0 0 0 test 4294934545 tcp.varnish

Looks like the port is (weird?) I beleive the port is 32785 but somehow in the state file its an unsigned weird value…

In the code there is this, that prevent to load the server successfully… due to the msg error being not empty…

port_str = params[14];
                        if (port_str) {
                                port = strl2uic(port_str, strlen(port_str));
                                if (port > USHRT_MAX) {
                                      chunk_appendf(msg, ", invalid srv_port value '%s'", port_str);
                                      port_str = NULL;
                                }
                        }
1 Like

@FrancisL What version did you apply the patch to? I patched 1.8.8 and the port looks ok.

[quote=“scarey, post:17, topic:2625”]I
you apply the patch to? I patched 1.8.8 and the port looks ok.
[/quote]

I actually used the latest version from github… 1.8.9 ? i believe…

I actually found out some of the issues wans’t the port itself, that port is an unsigned value which is fine… (i suppose).

What I had found is that in the server.c file I had to comment this, otherwise somehow the “Process” would hang… returning 503’s …

			/*
			// prepare DNS resolution for this server  (but aint this has already been done by the server-template function?)
			res = srv_prepare_for_resolution(srv, fqdn);
			if (res == -1) {
				ha_alert("could not allocate memory for DNS REsolution for server ... '%s'\n", srv->id);
				chunk_appendf(msg, ", can't allocate memory for DNS resolution for server '%s'", srv->id);
				HA_SPIN_UNLOCK(SERVER_LOCK, &srv->lock);
				goto out;
			}
			*/

Hi guys,

You’re suppose to apply the patches to the -dev code. We’ll do the backport in 1.8 and see if this backports require some updates.

Could you confirm what’s the latest status on your side: working or not working (on 1.9 of course)?

Baptiste

Hi @Baptiste

For me it doesn’t work (on 1.9, dev). I’ve also emailed you separately about this issue with logs. I’ll send you a reminder via email:

haproxy -vv
HA-Proxy version 1.9-dev0-e115e0-478 2018/07/12
Copyright 2000-2017 Willy Tarreau &lt;willy@haproxy.org&gt;

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -g -O2 -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_THREAD=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_TFO=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8

Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK

Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace 

I keep getting a 503:

>> curl user-server
503 Service Unavailable
No server is available to handle this request.

For me it doesnt work For me it doesn’t work also on 1.9 we get 503’s as soon as the master notify the worker to respawn… If i remove the srv_prepare_ for_resolution section then it seems to be fine…

Hi Francis,

What do you mean by removing srv_prepare_for_resolution ?

Can you tell me exactly what you did, share state file, logs, output of haproxy in debug mode?

I mean, the code works on my laptop with consul as a DNS server.

Looking forward to fixing it!

Baptiste

Note: I am using 1.2.0-dev (I noticed that consul 1.2 seems to have IPv6 errors…)

This is really what I removed from the server.c file

/*
// prepare DNS resolution for this server (but aint this has already been done by the server-template function?)
res = srv_prepare_for_resolution(srv, fqdn);
if (res == -1) {
ha_alert(“could not allocate memory for DNS REsolution for server … ‘%s’\n”, srv->id);
chunk_appendf(msg, “, can’t allocate memory for DNS resolution for server ‘%s’”, srv->id);
HA_SPIN_UNLOCK(SERVER_LOCK, &srv->lock);
goto out;
}
*/

I also had to implement some modifications in : [src/proxy.c]

(https://github.com/ACenterA/haproxy/commit/1ec245208976366960ff62d25000985801b93e46#diff-70645453d998e55219270ded2f5b1b25)
https://github.com/ACenterA/haproxy/commit/d99f3ee0644ad827f5fe9d10067223e62839bd2f

I know the state file changes might have some other impacts, but that is the only way I could get everything “working”.

In short, I launch an “ab -c 10 -n 100 https://myhostname” and i force a reload by sending an kill -SIGUSR2 to the haproxy which restarts the workers and then gives 503 without these fixes on my end…

With my fixes I implemented the service is stable upon reload. I dont think they are the right fixes though.

I’m using consul 1.1.0, but I don’t think the problem is related to it.

I don’t really understand the changes you did in proxy.c. Could you show me the final version of the file.
Also, could you show me an output of the state file?

There is an easier way to test if this all works.

Start HAProxy wait a bit (1 minute), save the server state, then stop HAProxy.

Then start HAProxy in debug mode (comment the master-worker statement in your config file).

Here is an example with my configuration:

./haproxy -d -db -f ./srv-records_server-state.cfg

SNOTE: setting global.maxconn to 2000.

Available polling systems :

epoll : pref=300, test result OK

poll : pref=200, test result OK

select : pref=150, test result FAILED

Total: 3 (2 usable), will use epoll.

Available filters :

[SPOE] spoe

[COMP] compression

[TRACE] trace

Using epoll() as the polling mechanism.

[WARNING] 217/164140 (22976) : Server www/srv5 is DOWN, changed from server-state after a reload. 9 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv6 is DOWN, changed from server-state after a reload. 8 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv7 is DOWN, changed from server-state after a reload. 7 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv8 is DOWN, changed from server-state after a reload. 6 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv9 is DOWN, changed from server-state after a reload. 5 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

[WARNING] 217/164140 (22976) : Server www/srv10 is DOWN, changed from server-state after a reload. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

==> servers 1 to 4 have not been configured, because their state was fully loaded by the state file below:

1

be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state srv_uweight srv_iweight srv_time_since_last_change srv_check_status srv_check_result srv_check_health srv_check_state srv_agent_state bk_f_forced_id srv_f_forced_id srv_fqdn srv_port srvrecord

3 www 1 srv1 192.168.0.1 0 0 2 1 64 7 0 0 7 0 0 0 A1.tld 80 _http._tcp.be1.tld

3 www 2 srv2 192.168.0.4 0 0 2 1 63 7 2 0 6 0 0 0 A4.tld 80 _http._tcp.be1.tld

3 www 3 srv3 192.168.0.2 0 0 2 1 63 7 2 0 6 0 0 0 A2.tld 80 _http._tcp.be1.tld

3 www 4 srv4 192.168.0.3 0 0 2 1 63 7 2 0 6 0 0 0 A3.tld 80 _http._tcp.be1.tld

3 www 5 srv5 - 0 0 1 1 63 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 6 srv6 - 0 0 1 1 63 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 7 srv7 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 8 srv8 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 9 srv9 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

3 www 10 srv10 - 0 0 1 1 62 5 2 0 6 0 0 0 - 0 _http._tcp.be1.tld

I can try to send you a patch with a lot of verbose messages, but it would be easier if I could access to one of your box where this code is installed.