Stick table backend inconsistency using DNS resolvers on upgrades

Hello community,

We’re using HAProxy in Kubernetes as a sticky load balancer in front of a deployment of five total pods (real HAProxy, not the ingress controller version of it). Here’s our config:

global
	daemon
	maxconn 10000
	stats socket /usr/local/etc/haproxy/admin.sock mode 600 level admin
	log /dev/log local0
	
defaults
	mode http
	timeout connect 5000ms
	timeout client 30000ms
	timeout server 30000ms

resolvers kubernetes
	nameserver skydns kube-dns.kube-system:53
	resolve_retries 10
	timeout retry 2s
	hold valid 5s

frontend http-in
	bind *:80
	log /dev/log local0
	option httplog
	default_backend servers

backend servers
	balance roundrobin
	stick-table type string size 100m
	option httpchk GET /health
	http-check expect status 200
	option tcp-check
	stick on path, word(3,/)
	server-template pod 5 pod.namespace.svc.cluster.local:8080 check resolvers kubernetes inter 500

As you can see we’re leveraging the server-templates and k8s dns resolvers to create the backend servers dynamically.

We’ve been pretty happy with the stick table approach so far, but we have some issues when doing a rolling upgrade of the backend servers. Kubernetes will start to terminate a pod and start-up a couple of new ones and wait until the old ones are completely wound down.

Now we’ve observed some inconsistency of the stick table during this rolling upgrade scenario, namely the instance of HAProxy would happily forward all requests to what it thinks is pod3. But on the request log of the backend we would see it ending up on three different backend servers.

Here’s the HAProxy request log for that period:

[29/Apr/2019:10:39:46.149] http-in servers/pod3 0/0/0/12812/12813 200 1031 - - ---- 303/303/235/3/0 0/0 "GET /v1/document/sticky_1556013348827_0fmbgvj50nmc/stepssince/983?_no_ie_cache=1556527174123 HTT
[29/Apr/2019:10:39:47.194] http-in servers/pod3 0/0/0/30/30 200 338 - - ---- 300/300/234/3/0 0/0 "GET /v1/document/sticky_1556013348827_0fmbgvj50nmc/stepssince/1006?_no_ie_cache=1556527187165 HTTP/1.1"
[29/Apr/2019:10:39:47.197] http-in servers/pod3 0/0/0/31/31 200 309 - - ---- 300/300/233/2/0 0/0 "POST /v1/document/sticky_1556013348827_0fmbgvj50nmc/steps HTTP/1.1"
[29/Apr/2019:10:39:47.277] http-in servers/pod3 0/0/0/34/34 200 338 - - ---- 300/300/233/2/0 0/0 "GET /v1/document/sticky_1556013348827_0fmbgvj50nmc/stepssince/1006?_no_ie_cache=1556527187248 HTTP/1.1"

HAProxy still thinks it is sending everything to pod3.

Here’s what we receive on the backend, where we can see requests for the same sticky_id path fragment to different pods at nearly the same time (request log -> pod identifier).

2019-04-29T10:39:46.962Z 'REQUEST-OK [GET] [/v1/document/sticky_1556013348827_0fmbgvj50nmc/stepssince/983]' -> pod-799496f69-dbp8s
2019-04-29T10:39:47.223Z 'REQUEST-OK [GET] [/v1/document/sticky_1556013348827_0fmbgvj50nmc/stepssince/1006]' -> pod-58fdfcc477-zjcrq
2019-04-29T10:39:47.227Z 'REQUEST-OK [POST] [/v1/document/sticky_1556013348827_0fmbgvj50nmc/steps]' -> pod-799496f69-dbp8s
2019-04-29T10:39:47.306Z 'REQUEST-START [GET] [/v1/document/sticky_1556013348827_0fmbgvj50nmc/stepssince/1006]'  -> pod-58fdfcc477-zjcrq

Here’s the proxy logs for that period that shows how the DNS resolver switches the IPs:

April 29th 2019, 10:39:19.000	[WARNING] 118/083919 (1) : Server servers/pod1 is going DOWN for maintenance (No IP for server ). 4 active and 0 backup servers left. 11 sessions active, 0 requeued, 0 remaining in queue.
April 29th 2019, 10:39:34.000	[WARNING] 118/083934 (1) : Server servers/pod1 ('pod.namespace.svc.cluster.local') is UP/READY (resolves again).
April 29th 2019, 10:39:34.000	[WARNING] 118/083934 (1) : Server servers/pod1 administratively READY thanks to valid DNS answer.
April 29th 2019, 10:39:34.000	[WARNING] 118/083934 (1) : Server servers/pod5 is going DOWN for maintenance (No IP for server ). 4 active and 0 backup servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
April 29th 2019, 10:39:34.000	[WARNING] 118/083934 (1) : servers/pod1 changed its IP from 172.20.5.223 to 172.20.4.207 by kubernetes/skydns.
April 29th 2019, 10:39:34.000	[WARNING] 118/083934 (1) : servers/pod2 changed its IP from 172.20.4.248 to 172.20.4.245 by DNS cache.
April 29th 2019, 10:39:39.000	[WARNING] 118/083939 (1) : servers/pod3 changed its IP from 172.20.5.50 to 172.20.5.232 by DNS cache.
April 29th 2019, 10:39:49.000	[WARNING] 118/083949 (1) : Server servers/pod5 ('pod.namespace.svc.cluster.local') is UP/READY (resolves again).
April 29th 2019, 10:39:49.000	[WARNING] 118/083949 (1) : Server servers/pod5 administratively READY thanks to valid DNS answer.
April 29th 2019, 10:39:49.000	[WARNING] 118/083949 (1) : servers/pod4 changed its IP from 172.20.5.176 to 172.20.4.111 by DNS cache.
April 29th 2019, 10:39:49.000	[WARNING] 118/083949 (1) : servers/pod5 changed its IP from 172.20.5.98 to 172.20.5.164 by DNS cache.

The HAProxy is a single pod that did not restart or anything else that would wipe the sticky table somehow.

We suspect that this might be due to connections being pooled and thus held on while the server underneath is actually changing its IP via the DNS resolver. Does HAProxy have draining support for such a scenario and using the server-template? What else could cause such a behaviour?

I understand that this is probably also fairly specific to Kubernetes, but any helpful pointers on what’s going on here are appreciated.

Thanks a ton,
Thomas