Understnd DNS resolver behaviour

Hi

We have a high-traffic HAPROXY POD running in the k8s environment. This HAPROXY POD acts as a proxy for a lot of backend services. All backend services are headless services, so upon service DNS resolution, it gets real IP address of pods

We had an incident and the behavior is bit confusing. I am looking expert explanation for this behaviour

Change of events

  1. VM team did a VMotion for their patching activity so VM moved one hypervisor to another hypervisor. They did around March 8th 12:14 am PST - 12:17 am PST. During Vmotion few seconds network disconnect is normal

  2. This HAPROXY Proxy working fine till 1:08 am PST

  3. HAPROXY Pods marked all bakends are not available so getting NOSRV in haproxy logs and also see all services DNS timeout messages

Mar 8 01:18:08 haproxy[103454]: Server /* is going DOWN for maintenance (DNS timeout status). 8 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue

  1. HAPROXY started resuming from 3:03 am pst onwards. we see DNS none to IP messages + 200 status codes
    Mar 8 03:03:19 haproxy[108195]: _backend/-svc1 changed its IP from (none) to 10.36.154.144 by dns/10.38.0.10.

HAPROXY has below DNS configuration
resolvers dns
accepted_payload_size 8192
parse-resolv-conf
hold valid 10s
hold timeout 3600s
hold refused 3600s
hold obsolete 600s
hold other 3600s

Questions:

  1. is the above behavior expected or not with the above configuration?
  2. if k8s cluster DNS servers not reachable due to Vmotion or the resolution failing than it should try 3 resolution ( I understood 3 is default resolve_retries value if not specified) after every 10 seconds ( hold valid value) than what it behaviour after this
    • mark backends down for 3600 seconds ( hold timeout value) ?? if yes why Proxy working from 12:14 am PST and failed again 1:03 am onwards?
    • or hold timeout of 3600 seconds used a valid cache so proxy working from 12:14 to 1:03 ( ~ 1 hour) later it marked all backends down?

any better explanation would helpful as these values are confusing

~ Srinivas Kotaru