Need help troubleshooting roundrobin with Kubernetes


#1

We have Kubernetes clusters running in Google Cloud that are using HAProxy as a reverse proxy, balancing to headless services. These services establish DNS SRV records for HAProxy to use for service discovery. We have externalTrafficPolicy set to Local in the internet facing service, so that the original source IP gets passed to HAProxy. Here’s an example of the balancing config for one of our backends and my resolver config:

backend tao-backend
   balance            roundrobin
   stick-table type ip size 1000k expire 14400m
   stick on src
   server-template tao 1 srv-tao.default.svc.cluster.local:6543 resolvers dns check

resolvers dns
nameserver dns1 10.47.240.10:53

Yesterday, I found documentation that indicated I needed to use the SRV records that begin with an underscore, but I also found a release note that says HAProxy will now also work if multiple A records are returned. I can tell from the HAProxy node that both the SRV and A record methods are returning IPs for each of the relevant pods, which indicates that Kubernetes considers both to be healthy. However, HAProxy in prod is sending traffic to only a single IP.

Here are the DNS records:

root@haproxy-867ddf67c5-7s7l7:~# dig +noall +answer SRV srv-tao.default.svc.cluster.local
srv-tao.default.svc.cluster.local. 30 IN SRV 10 50 0 3239353830366139.srv-tao.default.svc.cluster.local.
srv-tao.default.svc.cluster.local. 30 IN SRV 10 50 0 3336303363316365.srv-tao.default.svc.cluster.local.

root@haproxy-867ddf67c5-7s7l7:~# dig +noall +answer A srv-tao.default.svc.cluster.local
srv-tao.default.svc.cluster.local. 8 IN A 10.44.1.2
srv-tao.default.svc.cluster.local. 8 IN A 10.44.4.86

I know I could also use the SRV records of the form _service._proto.name., but it’s working fine in my dev environment without using that.

I’ve also enabled the stats module to try to get some idea of why only a single pod keeps getting all the traffic. With the following configuration, it’s not giving me anything useful.

backend stats
   stats enable
   stats auth  admin:********
   stats admin if TRUE
   stats realm   HAProxy\ Statistics
   stats refresh 5s
   stats show-desc
   stats show-legends

From the HAProxy logs, I can see that all the requests going to HAProxy have the original source IP rather than the internet facing service, so I’ve ruled out all the src IPs being the same for stickiness.

Has anyone seen an issue like this before, or can anyone point me in the direction of how to determine exactly why HAProxy would use only one IP when DNS is returning 2?

Thanks!


#2

Hi,

Your server-template line creates a single server slot, so your backend will have a single server, hence HAProxy can’t load-balance the traffic.
Update this line to something like:

server-template tao 10 srv-tao.default.svc.cluster.local:6543 resolvers dns check

#3

Thanks Baptiste, I had wondered if that might be the case, but I was confused, since one environment was load balancing, and the other wasn’t. Then, after another restart yesterday, the switched - prod worked, but dev didn’t. Later, after another restart, they both worked. I had thought it might be the DNS deadlock bug that was fixed in 1.8.10, but after thinking about the server count, I think I understand what we were observing. We have 2 HAProxy pods with a Kubernetes load balancer in front of them, using ClientIP affinity. So I think that whenever we didn’t observe load balancing happening, it was just that both HAProxy servers happened to choose the same IP.