Aiming to solve the issue where Haproxy would only resolve the DNS during the startup instead of “on the run”, I created a new Google Cloud VM running HaProxy 2.2.9.
I did set the resolve block:
resolvers pc-dns
nameserver pcdns "Google Internal DNS Resolver IP"
resolve_retries 30
timeout retry 1s
hold valid 5s
Then add at the of each backend server:
....... check port 80 inter 1s fall 1 rise 3 resolvers pc-dns
I have tried both:
resolvers pc-dns
nameserver dns1 "Google Internal DNS Resolver IP":53
hold valid 5s
and
resolvers pc-dns
parse-resolv-conf
hold valid 10s
But I ended up with the same result. After 40s or so, the backend servers are set to MAINT.
But the servers are up and running
Hey @lukastribus sorry for tagging you but I am hoping you could point me in the right direction.
This is the resolvers section. Google Cloud VMs have an internal nameserver under their resolv.conf so why I am not explicating it here:
resolvers pcdns
parse-resolv-conf
resolve_retries 3
timeout resolve 1s
timeout retry 1s
hold other 30s
hold refused 30s
hold nx 30s
hold timeout 30s
hold valid 10s
hold obsolete 30s
The backend looks like:
backend....
mode http
option httpchk
http-check send meth GET uri /session/status
http-check expect string success
server ............ weight 10 check port 80 inter 1s fall 1 rise 3 resolvers pcdns
After a systemctl reload haproxy, the static page shows everything up for a few seconds before setting everything to MAINT and resolution under the check column.
I am trying to solve this with HaProxy. Google LB sucks big time and I am kinda the one blocking that from happening. HaProxy is way better for our env.
Server will go into MAINT on DNS resolution failures, so yes, this just means you need basic DNS troubleshooting.
To disable initial libc resolution, and therefor have a clearer failure picture in this situation (it won’t fail after some time, but immediatly), use the init-addr last server option.
The option below gives me the error: no method found to resolve address
I saw another post where you can use none instead of last. The error message is gone, but the problem persists.
defaults
default-server init-addr last
I have changed from/to
#from
resolvers pcdns
parse-resolv-conf
#to
resolvers pcdns
# Added the line below
nameserver dns1 169.254.169.254:53
@baptiste64 sorry for the noobie question but first I don’t entirely know how to do that DNS capture (Google should help me out)
Also, does that means the HAProxy is receiving on 53??? If that is true, the problem then is the firewall.
HaProxy has only HTTP/HTTPS and 8161 port open to the network.
Today is my deadline, my manager is pushing Google LB but I am being stubborn and need to find a solution by COB lol
HAPROXY_VM_HOSTNAME.internal.49972 > 169.254.169.254: [bad udp cksum 0x6879 -> 0x1f18!] ...BACKEND_SERVER... to be continued
169.254.169.254 > HAPROXY_VM_HOSTNAME.internal.54778: [udp sum ok] ...BACKEND_SERVER... ...to be continued...
I don’t think those “bad udp cksum” were supposed to be there. Especially because it is the VM calling the internal DNS resolver.
I tried that tcpdump on my desktop and I got some bad udp cksum.
In my case tho I do use Pi-Hole and firewall rules to filter DNS requests so I cannot use it as an example.
But please, share what you think as I could be 100% wrong
Ok, so from the outputs you can see that the DNS server responds with NXDomain, meaning that the DNS server did not find any records for the hostname you are querying.
I think that the problem here could be the missing search path. It is in resolv.conf, but haproxy only parses nameserver out of it, not the search path.
Can you try changing:
session01 to session01.MY_CLOUD_PROJECT_ID_HERE.internal google.internal and session01-dev to session01-dev.MY_CLOUD_PROJECT_ID_HERE.internal google.internal
I made the changes and voila, it worked.
tcpdump is returning A/AAAA and no more NDXDomain
HaProxy is no longer setting backends to MAINT.
It doesn’t make sense tho. /etc/hosts has both session01 and session01.MY_CLOUD_PROJECT_ID_HERE.internal google.internal “layout” into all the VMs. I might need to contact the terrible Google Cloud support again.
I just deleted one of the backends, built a new VM, released microservice to it (Jenkins) and voila. HaProxy started talking to it after the VM passed the health test without reloading HaProxy.