Hey guys, I’ve run into an interesting problem as I’ve started scaling up my smallish video streaming cdn. It seems that haproxy is causing my tcp stack to run into a bottleneck, resulting in incredibly poor single threaded tcp performance across the entire linux stack, not just haproxy. Because I don’t wish to spam iperf3 tests or logs, I’m going to upload these to pastebin and link them. Let’s start with what my sysctls look like:
Some highlights here: max open files set way up, using bbr congestion control, somaxconn cranked up, larger tcp send/transmit buffers, and increased tw bucket size. This sysctl is huge, but basically this is the result of me desperately turning nobs and dials. I experience this same behavior under stock sysctls. I have additionally tried increasing txqueuelen and ring buffers on the ethernet card, tried turning off some of the hardware offloading that was happening, pretty much anything I could think of.
Here is what an iperf3 test SHOULD look like on this machine, this is from Chicago to LA
The retries are normal for bbr, not worried about those
Here is what an iperf3 test looks like while haproxy is running, pushing ~600mbit of traffic, same 10gig connection, yes I have rerun this test many many many times to get similar results
Now here’s what’s interesting, look what happens when you increase parallel streams. I’ve only included the summary page, but this is also between Chicago and LA. The bandwidth is there, my single tcp threads are just not speeding up and I don’t know why
Here is my haproxy -vv output:
And here is part of my haproxy config:
Please note that I have about 1000 of these such entries. To speed up initial load and subsequent reloads, I’m caching the dns with a local bind9 server so that it doesn’t take forever to start and resolve. Additionally, I’m only running about 2000 connections and 1600 sockets, as it’s a video streaming cdn, there aren’t a ton of connections, mostly just raw throughput for high bitrate videos
Some additional steps I have tried are completely disabling healthchecks, dramatically increasing conntrack size, disabling it completely, and considered flying out to chicago and hitting the server with a hammer
Any help would be greatly appreciated, I’m a little bit at my whit’s end here and have no idea what my next steps should be