Haproxy1.8.3 dynamic dns resolvers problem

The problem:

when I reload haproxy, my backend server goes DOWN each time because of DNS timeout, and the output logging is:

[WARNING] 014/190450 (22) : Reexecuting Master process
[WARNING] 014/191701 (22) : parsing [/usr/local/etc/haproxy/haproxy.cfg:51]: 'log-format' overrides previous 'option httplog' in 'defaults' section.
[WARNING] 014/191701 (22) : Setting tune.ssl.default-dh-param to 1024 by default, if your workload permits it you should set it to at least 2048. Please set a value >= 1024 to make this warning disappear.
[WARNING] 014/191701 (22) : [haproxy.main()] Cannot raise FD limit to 2097186, limit is 1048576.
Jan 15 19:17:01 localhost haproxy[22]: Proxy name_resolver_http started.
Jan 15 19:17:01 localhost haproxy[22]: Proxy nginx_nginx-80-servers started.
Jan 15 19:17:01 localhost haproxy[47]: Stopping proxy stats in 0 ms.
Jan 15 19:17:01 localhost haproxy[47]: Stopping frontend name_resolver_http in 0 ms.
Jan 15 19:17:01 localhost haproxy[47]: Stopping backend nginx_nginx-80-servers in 0 ms.
Jan 15 19:17:01 localhost haproxy[47]: Proxy stats stopped (FE: 1676 conns, BE: 10 conns).
Jan 15 19:17:01 localhost haproxy[47]: Proxy name_resolver_http stopped (FE: 1707 conns, BE: 0 conns).
Jan 15 19:17:01 localhost haproxy[47]: Proxy nginx_nginx-80-servers stopped (FE: 0 conns, BE: 3 conns).
[WARNING] 014/191701 (22) : [haproxy.main()] FD limit (1048576) too low for maxconn=1048576/maxsock=2097186. Please raise 'ulimit-n' to 2097186 or more to avoid any trouble.
Jan 15 19:17:01 localhost haproxy[58]: Health check for server nginx_nginx-80-servers/nginx_nginx_9151bcdcd9d17452534689968a4ca067b3da3164a171de7731369a5862c9d646_80 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
[WARNING] 014/191701 (22) : Former worker 47 exited with code 0
Jan 15 19:17:12 localhost haproxy[58]: Server nginx_nginx-80-servers/nginx_nginx_9151bcdcd9d17452534689968a4ca067b3da3164a171de7731369a5862c9d646_80 is going DOWN for maintenance (DNS timeout status). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jan 15 19:17:12 localhost haproxy[58]: backend nginx_nginx-80-servers has no server available!

My haproxy.conf:

global
    maxconn 1048576
    stats socket /var/run/haproxy-admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    pidfile /var/run/haproxy.pid
    log 127.0.0.1 local0
    max-spread-checks 60s
    master-worker no-exit-on-failure
    nbthread 2

resolvers mydns
    nameserver dns1 127.0.0.11:53
    resolve_retries       3
    timeout retry          1s
    hold other             10s
    hold refused          10s
    hold nx                  10s
    hold timeout          10s
    hold valid              10s

defaults
    mode http
    maxconn 1048576

    balance roundrobin
    timeout connect 5000ms
    timeout client 65000ms
    timeout server 65000ms
    timeout tunnel 3600s
    timeout check 5s

    option httplog
    option dontlognull
    option http-server-close
    option abortonclose
    option log-health-checks
    log global
    log-format %ci:%cp\ [%t]\ %Tr\ %s\ %ST\ %B\ %hr\ %hs\ %H\ %{+Q}r

    # If sending a request to one server fails, try to send it to another, 3 times
    # before aborting the request
    retries 3
    #http-reuse safe
    option forwardfor
    # Do not enforce session affinity (i.e., an HTTP session can be served by
    # any Mongrel, not just the one that started the session
    option redispatch
    no option checkcache
    option accept-invalid-http-response
    option accept-invalid-http-request
    default-server init-addr last,libc,none

frontend name_resolver_http
    bind *:80
    errorfile 503 /usr/local/etc/haproxy/errors/503.http
    capture request header Host len 80
    monitor-uri /haproxy-monitor
    acl is_websocket hdr(Upgrade) -i WebSocket

    acl is_appone.example.com hdr_reg(host) -i ^appone.example.com(:[0-9]+)?$
    acl is_appone.example.com_port hdr(host) -i appone.example.com:80
    use_backend nginx_nginx-80-servers if is_appone.example.com or is_appone.example.com_port

backend nginx_nginx-80-servers
    server nginx_nginx_9151bcdcd9d17452534689968a4ca067b3da3164a171de7731369a5862c9d646_80 nginx_nginx.1.uh0hgakyv1pe5pigo2xrligm2:80 cookie 9151bcdcd9d17452534689968a4ca067b3da3164a171de7731369a5862c9d646 weight 100 check resolvers mydns

At the same time, I have captured the packets used by tcpdump:

19:17:01.297238 Out 02:42:ac:11:00:04 ethertype IPv4 (0x0800), length 101: (tos 0x0, ttl 64, id 21822, offset 0, flags [DF], proto UDP (17), length 85)
    172.17.0.4.56594 > 100.100.2.138.53: [bad udp cksum 0x1356 -> 0x967e!] 44772+ A? nginx_nginx.1.uh0hgakyv1pe5pigo2xrligm2. (57)
19:17:01.307256  In 02:42:39:3e:c5:dd ethertype IPv4 (0x0800), length 176: (tos 0x0, ttl 63, id 23130, offset 0, flags [none], proto UDP (17), length 160)
    100.100.2.138.53 > 172.17.0.4.56594: [udp sum ok] 44772 NXDomain q: A? nginx_nginx.1.uh0hgakyv1pe5pigo2xrligm2. 0/1/0 ns: . [3h] SOA a.root-servers.net. nstld.verisign-grs.com. 2018011500 1800 900 604800 86400 (132)
19:17:01.307382  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 176: (tos 0x0, ttl 64, id 9331, offset 0, flags [DF], proto UDP (17), length 160)
    127.0.0.11.53 > 127.0.0.1.50611: [bad udp cksum 0xfea9 -> 0xc9b2!] 44772 NXDomain q: A? nginx_nginx.1.uh0hgakyv1pe5pigo2xrligm2. 0/1/0 ns: . [3h] SOA a.root-servers.net. nstld.verisign-grs.com. 2018011500 1800 900 604800 86400 (132)
19:17:01.417324  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 9348, offset 0, flags [DF], proto UDP (17), length 140)
    127.0.0.11.53 > 127.0.0.1.55220: [bad udp cksum 0xfe95 -> 0x49ce!] 51922 q: A? nginx_nginx.1.uh0hgakyv1pe5pigo2xrligm2. 1/0/0 nginx_nginx.1.uh0hgakyv1pe5pigo2xrligm2. [10m] A 10.254.0.5 (112)
19:17:01.417499  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 9349, offset 0, flags [DF], proto UDP (17), length 140)
    127.0.0.11.53 > 127.0.0.1.53913: [bad udp cksum 0xfe95 -> 0x7275!] 42822 q: A? nginx_nginx.1.uh0hgakyv1pe5pigo2xrligm2. 1/0/0 nginx_nginx.1.uh0hgakyv1pe5pigo2xrligm2. [10m] A 10.254.0.5 (112)
19:17:01.447001  In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 101: (tos 0x0, ttl 64, id 9353, offset 0, flags [DF], proto UDP (17), length 85)
    127.0.0.11.53 > 127.0.0.1.60407: [bad udp cksum 0xfe5e -> 0xf1d5!] 48670 q: AAAA? nginx_nginx.1.uh0hgakyv1pe5pigo2xrligm2. 0/0/0 (57)

We can see the successful DNS response at 19:17:01, but the haproxy process set my nginx backend server DOWN at 19:17:12.

What happened? And the problem is never found in haproxy 1.8.2

Can you try without nbthread please?

Yeah, it works fine when i set nbthread to one.
But we hope to enable the multi-thread feature, please does a fix in the next release.
Looking forward to it and thanks a lot.

Absolutely, it’s just to confirm whether this is related to multithreading or not.

Can you confirm that you had multithreading enabled when using 1.8.2 or so multithreading did actually work together with DNS resolution in 1.8.2?

Yeah… I’m sure…
The haproxy config is the same as 1.8.2, and the only difference is the haproxy version, nothing else.
When I run haproxy 1.8.2 that had multithreading enabled, it works fine, but not in the 1.8.3.

@lukastribus , I’ve found one more problem. Some processes stop to discover and add new backend servers. There is no any errors or warnings in the log file

grep nbp /etc/haproxy/haproxy.cfg
nbproc 6

for i in $(seq 1 6); do echo “show servers state” | socat unix-connect:/var/run/haproxy-$i.sock stdio|grep ec2.int|wc -l; done
41
41
41
41
29
41

haproxy -v
HA-Proxy version 1.8.3-205f675 2017/12/30
Copyright 2000-2017 Willy Tarreau willy@haproxy.org

The issue is present in 1.8.2 also

Here is the same server after 24hours

for i in $(seq 1 6); do echo “show servers state” | socat unix-connect:/var/run/haproxy-$i.sock stdio|grep ec2.int|wc -l; done
37
37
37
37
29
37

little bit later:

for i in $(seq 1 6); do echo “show servers state” | socat unix-connect:/var/run/haproxy-$i.sock stdio|grep ec2.int|wc -l; done
36
36
36
36
29
36

Here is example from another server

for i in $(seq 1 6); do echo “show servers state” | socat unix-connect:/var/run/haproxy-$i.sock stdio|grep ec2.int|wc -l; done
35
35
35
35
47
47

This bug depends on servers count in backend. It works more or less fine with less than 25-30 servers.

Looks like discover process just hangs. Is there any way to provide more debug on this @lukastribus ?

@ASA please don’t hijack other threads. If you have a new issue kindly open a new thread.

Also provide the configuration and the output of haproxy -vv (not -v).

@Baptiste you may want to check this out

Sorry. I’ve fixed that

@quanzhao: please try latest 1.8-git (commmit 945f4cf08 or later), or use the tomorrows 1.8-snapshot (should be available at about 09:00 UTC tomorrow):

http://www.haproxy.org/download/1.8/src/snapshot/haproxy-ss-20180124.tar.gz

There are multiple bugfixes related to threading, startup, polling and master/worker mode, likely your issue is a symptom of one of those already fixed bugs.

Thanks a lot, I will try it later…

Still experiencing this error on the docker haproxy:alpine.
Has the docker already been updated?

Looks like it was updated a few hours ago to Haproxy 1.8.4, which should contain all the fixes.

Tried with FROM haproxy:1.8.4-alpine
Unfortunately, the issue persists.
Am using haproxy:1.7-alpine for now

@MonsieurWave I have no clue what issue you are talking about. If you have the same issue as the OP, then you just need to disabled (the experimental) nbthread feature, and it will work fine with 1.8.4. If it’s a different issue, then please open a new thread.

@lukastribus : Yes, I am talking about the same issue
Jan 15 19:17:12 localhost haproxy[58]: Server nginx_nginx-80 servers/nginx_nginx_9151bcdcd9d17452534689968a4ca067b3da3164a171de7731369a5862c9d646_80 is going DOWN for maintenance (DNS timeout status). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. Jan 15 19:17:12 localhost haproxy[58]: backend nginx_nginx-80-servers has no server available!

Setting the config to nbthread 1 did not help.