The limitation to 64 threads or processes is indeed an architectural limitation, it stems from the fact that a 64-bit CPU can only perform atomic operations on 64-bit masks, which limits the number of threads or processes that may be eligible for a task, file descriptor etc. And on 32-bit processors, this limit is indeed 32 threads or processes.
What ought to be improved would be the ability for haproxy to define the CPU affinity to CPU numbers higher than 63, because in your case you could be limited in your ability to define an optimal mask for cpu-map. This one is mostly OS-dependent though and improving this should not be that big of a deal. In the short term you can still use taskset and not have to wonder about cpu-map.
A side note however on your machine. This is by far the worst ever one to run a workload like haproxy. Low frequency cores will mean high latency on many operations. And the worst of all are the two sockets. You must never ever run a network-like workload on such a machine, because the communications between the two physical CPUs can go exclusively through the QPI which is extremely slow and limited compared to an internal L3 cache. You can very easily end up with a machine that’s 2-5 times SLOWER than one using a single socket by doing so. I strongly encourage you to use only the CPUs from the socket the network card is attached to (so that network traffic doesn’t have to fly over the QPI bus either). And if possible, use numactl to let the operating system know that you’d rather use the DRAM physically attached to the same socket instead of reaching it through the other one.
Multi-socket machines are nice for storage systems, hosting many VMs or huge Java applications. They are terribly slow for low-latency workloads. You could save many dollars and watts by using less cores at a higher frequency, and end up with more performance. Keep this in mind for next time you have an opportunity to recycle that machine for another use case!