HAProxy community

Architectural limitation for nbproc?

Hi all,

I’m migrating HAProxy to nodes with 2x20 core Xeons, for a total of 80 vcpus seen by the OS. I want to use cpu-map to pin processes to vcpus for performance reasons, and have two questions:

  1. What is the specific reason(s) that MAX_PROCS is limited to 64? Can someone elaborate whether this is an architectural limitation?

  2. Can you suggest configuration options for performance that takes into account the numa grouping (0-19,40-59 to node 0 and 20-39,60-79 to node 1) and could allow us to use all 80 vcpus?

I included some /proc/cpuinfo output below for clarity.

These systems currently receive hundreds of millions of tcp requests a day, with an average number of active connections somewhere between 60k and 80k.

Thanks in advance!

processor	: 79
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
stepping	: 7
microcode	: 0x500002c
cpu MHz		: 1834.253
cache size	: 28160 KB

CCing @willy

The limitation to 64 threads or processes is indeed an architectural limitation, it stems from the fact that a 64-bit CPU can only perform atomic operations on 64-bit masks, which limits the number of threads or processes that may be eligible for a task, file descriptor etc. And on 32-bit processors, this limit is indeed 32 threads or processes.

What ought to be improved would be the ability for haproxy to define the CPU affinity to CPU numbers higher than 63, because in your case you could be limited in your ability to define an optimal mask for cpu-map. This one is mostly OS-dependent though and improving this should not be that big of a deal. In the short term you can still use taskset and not have to wonder about cpu-map.

A side note however on your machine. This is by far the worst ever one to run a workload like haproxy. Low frequency cores will mean high latency on many operations. And the worst of all are the two sockets. You must never ever run a network-like workload on such a machine, because the communications between the two physical CPUs can go exclusively through the QPI which is extremely slow and limited compared to an internal L3 cache. You can very easily end up with a machine that’s 2-5 times SLOWER than one using a single socket by doing so. I strongly encourage you to use only the CPUs from the socket the network card is attached to (so that network traffic doesn’t have to fly over the QPI bus either). And if possible, use numactl to let the operating system know that you’d rather use the DRAM physically attached to the same socket instead of reaching it through the other one.

Multi-socket machines are nice for storage systems, hosting many VMs or huge Java applications. They are terribly slow for low-latency workloads. You could save many dollars and watts by using less cores at a higher frequency, and end up with more performance. Keep this in mind for next time you have an opportunity to recycle that machine for another use case!

Here’s a nice presentation of Netflix regarding NUMA optimizations of their stack based on FreeBSD. While achieving 200Gbit/s is hardly relevant for the use-case here, it does explain the issues around different NUMA nodes quite well: