HAProxy community

Architectural limitation for nbproc?

Hi all,

I’m migrating HAProxy to nodes with 2x20 core Xeons, for a total of 80 vcpus seen by the OS. I want to use cpu-map to pin processes to vcpus for performance reasons, and have two questions:

  1. What is the specific reason(s) that MAX_PROCS is limited to 64? Can someone elaborate whether this is an architectural limitation?

  2. Can you suggest configuration options for performance that takes into account the numa grouping (0-19,40-59 to node 0 and 20-39,60-79 to node 1) and could allow us to use all 80 vcpus?

I included some /proc/cpuinfo output below for clarity.

These systems currently receive hundreds of millions of tcp requests a day, with an average number of active connections somewhere between 60k and 80k.

Thanks in advance!

processor	: 79
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
stepping	: 7
microcode	: 0x500002c
cpu MHz		: 1834.253
cache size	: 28160 KB

CCing @willy

The limitation to 64 threads or processes is indeed an architectural limitation, it stems from the fact that a 64-bit CPU can only perform atomic operations on 64-bit masks, which limits the number of threads or processes that may be eligible for a task, file descriptor etc. And on 32-bit processors, this limit is indeed 32 threads or processes.

What ought to be improved would be the ability for haproxy to define the CPU affinity to CPU numbers higher than 63, because in your case you could be limited in your ability to define an optimal mask for cpu-map. This one is mostly OS-dependent though and improving this should not be that big of a deal. In the short term you can still use taskset and not have to wonder about cpu-map.

A side note however on your machine. This is by far the worst ever one to run a workload like haproxy. Low frequency cores will mean high latency on many operations. And the worst of all are the two sockets. You must never ever run a network-like workload on such a machine, because the communications between the two physical CPUs can go exclusively through the QPI which is extremely slow and limited compared to an internal L3 cache. You can very easily end up with a machine that’s 2-5 times SLOWER than one using a single socket by doing so. I strongly encourage you to use only the CPUs from the socket the network card is attached to (so that network traffic doesn’t have to fly over the QPI bus either). And if possible, use numactl to let the operating system know that you’d rather use the DRAM physically attached to the same socket instead of reaching it through the other one.

Multi-socket machines are nice for storage systems, hosting many VMs or huge Java applications. They are terribly slow for low-latency workloads. You could save many dollars and watts by using less cores at a higher frequency, and end up with more performance. Keep this in mind for next time you have an opportunity to recycle that machine for another use case!

Here’s a nice presentation of Netflix regarding NUMA optimizations of their stack based on FreeBSD. While achieving 200Gbit/s is hardly relevant for the use-case here, it does explain the issues around different NUMA nodes quite well:

@willy thanks so much for the perfectly concise and clear explanation! I spent a good amount of time looking at the code this weekend, and gleaned a fair amount of what you just summarized from it, but now it’s totally clear.

Regarding the hardware…it’s the configuration of our standard general-purpose compute nodes (we have > 100 of them per rack) and the goal is to use homogenous hardware as much as is feasible. Currently, we set aside two of them per rack to use for haproxy to load balance traffic to all of the other nodes. Unfortunately, to make things worse, they have a single 25Gbit NIC…so doing something like running two virtualized systems (one on each physical cpu) or two independently-configured sets of haproxy processes, is going to run into the same problem.

@lukastribus thanks for the link, I’ll check it out!

Then just identify which socket your single NIC is connected to and make
sure you bind haproxy and the network interrupts to the CPUs from that
socket exclusively. It just means the other CPU cores will be usable
for whatever else you may have to run on this machine. But quite frankly,
if you use this large hardware as the standard server, you’re already
expecting to leave many cores unused so that’s not a problem. And don’t
forget numactl to try to optimize memory allocation to stick to the node
you’re running on.