Stick-table fails when balancing MQTT traffic. Any suggestions or debugging hints?

I’m running the latest haproxytech/haproxy-ubuntu Docker image:

HAProxy version 2.5.1-86b093a 2022/01/11 - https://haproxy.org/
Running on: Linux 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64

My goal is to split MQTT connections between 3 backend servers while giving the clients a possibility to stick to one of them by using MQTT username as an ID. So far I managed to understand the concepts (except maybe sharing server template between two backends, but that’s not the issue now) and come up with the following config:

    global
        log stdout format raw daemon debug
        stats socket :9999 level admin expose-fd listeners

    defaults
        log global
        mode tcp
        option tcplog
        maxconn 1024000
        timeout connect 30000
        timeout client 600s
        timeout server 600s

    frontend mqtt_tcp
        mode tcp
        option tcplog
        bind *:1883
        bind *:8883 ssl crt /etc/haproxy-ssl/my.domain.com.pem
        default_backend mqtt_tcp_back

    frontend mqtt_ws
        mode tcp
        option tcplog
        bind *:8083 ssl crt /etc/haproxy-ssl/my.domain.com.pem
        default_backend mqtt_ws_back

    resolvers cluster_dns
        nameserver dns1 10.96.0.10:53
        accepted_payload_size 8192 # allow larger DNS payloads

    backend mqtt_usernames
        stick-table type string len 32 size 100k expire 30m

    backend mqtt_tcp_back
        mode tcp
        stick on req.payload(0,0),mqtt_field_value(connect,username) table mqtt_usernames

        server-template mqtt_tcp 10 mqtt-daemon-pods.default.svc.cluster.local:1883 check resolvers cluster_dns init-addr none

    backend mqtt_ws_back
        mode tcp
        stick on req.payload(0,0),mqtt_field_value(connect,username) table mqtt_usernames

        server-template mqtt_tcp 10 mqtt-daemon-pods.default.svc.cluster.local:9001 check resolvers cluster_dns init-addr none

Everything is more or less fine. I can see the table being populated with correct keys

# table: mqtt_usernames, type: string, size:102400, used:1
0x56047667ac50: key=gateway use=0 exp=1518633 server_id=1 server_key=mqtt_tcp1

But not all the connections stick to the mqtt_tcp1. When I run

mosquitto_sub -h 192.168.0.80 -u gateway -P qwerty -t test
for run in {1..50}; do mosquitto_pub -h 192.168.0.80 -u gateway -P qwerty -t test -m "hello $run"; done

I see 3-5 messages missing, like:

...
hello 42
hello 43
hello 45
hello 47
hello 48
hello 49
hello 50

and the log

10.244.0.1:47865 [04/Feb/2022:10:50:04.908] mqtt_tcp mqtt_tcp_back/mqtt_tcp1 1/0/0 4 -- 2/2/1/1/0 0/0
10.244.0.1:64431 [04/Feb/2022:10:50:04.910] mqtt_tcp mqtt_tcp_back/mqtt_tcp1 1/0/0 4 -- 2/2/1/1/0 0/0
10.244.0.1:7948 [04/Feb/2022:10:50:04.914] mqtt_tcp mqtt_tcp_back/mqtt_tcp2 1/0/0 4 -- 2/2/1/0/0 0/0
10.244.0.1:27261 [04/Feb/2022:10:50:04.916] mqtt_tcp mqtt_tcp_back/mqtt_tcp1 1/0/0 4 -- 2/2/1/1/0 0/0
10.244.0.1:41658 [04/Feb/2022:10:50:04.919] mqtt_tcp mqtt_tcp_back/mqtt_tcp3 1/0/0 4 -- 2/2/1/0/0 0/0
10.244.0.1:38141 [04/Feb/2022:10:50:04.921] mqtt_tcp mqtt_tcp_back/mqtt_tcp1 1/0/0 4 -- 2/2/1/1/0 0/0
10.244.0.1:62642 [04/Feb/2022:10:50:04.924] mqtt_tcp mqtt_tcp_back/mqtt_tcp1 1/0/0 4 -- 2/2/1/1/0 0/0
10.244.0.1:23143 [04/Feb/2022:10:50:04.926] mqtt_tcp mqtt_tcp_back/mqtt_tcp1 1/0/0 4 -- 2/2/1/1/0 0/0
10.244.0.1:3238 [04/Feb/2022:10:50:04.929] mqtt_tcp mqtt_tcp_back/mqtt_tcp1 1/0/0 4 -- 2/2/1/1/0 0/0

Am I missing something simple? Can some logging be added to see why the respective servers are chosen? I already tried removing the WS backend and TLS termination, to leave only port 1883. It did not help. The closest problem I’ve seen so far is this but with not too much enthusiasm about it.

There is a hint in the docs that sticky rules don’t work properly with nbpoc > 1. I tried setting it explicitly, but got an error:

[ALERT]    (1) : config : parsing [/usr/local/etc/haproxy/haproxy.cfg:2] : nbproc is not supported any more since HAProxy 2.5. Threads will automatically be used on multi-processor machines if available.

so it should be fine in this regard. Nevertheless, there are two haproxy processes running in my docker:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.1 192268 80556 ?        Ss   Feb04   0:00 haproxy -sf 8 -W -db -f /usr/local/etc/haproxy/haproxy.cfg
root           8  0.1  0.1 975100 72100 ?        Sl   Feb04   5:58 haproxy -W -db -f /usr/local/etc/haproxy/haproxy.cfg

I don’t understand the reason for this so far. Is it ok or should I look deeper?

Ok. It’s not about processes. At least all the log entries are produced by PID=8.

Judging by debugging prints it sometimes fails in src/sample.c:1125

if (!conv_expr->conv->process(conv_expr->arg_p, p, conv_expr->conv->private))

when processing ‘mqtt_field_value’.

Now I’m trying to figure out how to debug that one.

…and sample_conv_mqtt_field_value, which is ‘process’ in that case, works fine given the input. Problem comes in smp_fetch_payload:

	if (smp->strm) {
		/* meaningless for HTX buffers */
		if (IS_HTX_STRM(smp->strm))
			return 0;
		chn = ((smp->opt & SMP_OPT_DIR) == SMP_OPT_DIR_RES) ? &smp->strm->res : &smp->strm->req;
		head = ci_head(chn);
		data = ci_data(chn);
	}

If chn->flags == 0x908002, I get my 54 bytes in data and everything is fine. If it’s 0x908000, I get 0 bytes, and the rest of the sticking functionality fails.

#define CF_READ_PARTIAL   0x00000002  /* some data were read from producer or a read exception occurred */
#define CF_AUTO_CLOSE     0x00008000  /* producer can forward shutdown to other side */
#define CF_STREAMER       0x00010000  /* the producer is identified as streaming data */
#define CF_ANA_TIMEOUT    0x00080000  /* the analyser timeout has expired */

The difference is in CF_READ_PARTIAL. I have no idea what’s happening, but from the shape of it, I guess the sticky parser is called in the wrong moment. Eventually, the data is read, although it goes to a random backend.

It’s getting kind of too big for me, and it’s pretty silent here. Is there any point in registering a bug report? And what’s the proper way to do it?

Solved. I added

tcp-request inspect-delay 10s
acl content_present req_len gt 0
tcp-request content accept if content_present

to my frontend config. Maybe not the best solution, but works for me so far. I’m yet to test it with bigger loads.

The fact that you see 2 processes is expected, because one is the master process handling reloads etc, and the other one is the actual worker. That is correct, and has nothing to do with the nbproc > 1 setting, which in this release is fully removed.

Matching payload always requires waiting for the buffers to be filled. I’m sorry you had to dig through the source code to find that. For the MQTT I suggest you take a look at this:

http://cbonte.github.io/haproxy-dconv/2.4/configuration.html#7.3.1-mqtt_is_valid

tcp-request inspect-delay 10s
tcp-request content reject unless { req.payload(0,0),mqtt_is_valid }

I’m not sure req_len gt 0 is a good idea, because it will fail when the first buffer doesn’t contain the full MQTT message.

Actually in the example before this:

http://cbonte.github.io/haproxy-dconv/2.4/configuration.html#7.3.1-mqtt_field_value

req.len ge 4

is used to wait for the MQTT message.

Thank you. I missed it somehow. On the bright side, I now understand the process a lot better.