Due to a bit of historical accident (related to the absense of the resolvers server option on older versions of haproxy), we currently reload config, restarting the haproxy worker process, approximately every minute. I’m looking at making better use of resolvers so we only reload when configuration actually changes, however I’ve run into a bit of a roadblock.
If the haproxy runs for more than a few minutes without being restarted, the CPU usage steadily climbs until haproxy is using 100% of available CPU. Unfortunately, I haven’t been able to reproduce this in non-production environments, and for obvious reasons I want to avoid this situation in production.
I have encountered an issue like this in the past, and looking at perf results, most of the cpu time was spent on operations on an LRU cache. I think it is the pattern LRU cache.
We do have regex acls that match urls that have a random id in them, and I’m wondering if that is causing the cache to fill up and we are spending a lot of CPU adding and removing entries from the cache.
Any suggestions on how to reduce CPU usage if the process runs a long time? Having to restart the process every minute seems like a really hacky solution.
And is this something I should make a bug report for?
I had another thought on this. Assuming my hypothis that this is related to the LRU cache is correct (I haven’t had a chance to fully verify that yet), I think what happens is the cache fills up, and as new requests come in with cache misses, entries in the LRU cache are rotated out, and those rotations require the binary search tree for the cache to need to be re-balanced quite frequently. Which makes me wonder if using a hash-table (implemented using an array instead of a tree) would be more efficient for the LRU cache, since it doesn’t require rebalancing. And since the LRU cache has a fixed size, the backing array wouldn’t need to be resized either. I’m not quite familiar enough with the code to know how feasible such a switch would be, or if there are major downsides. I also don’t have any hard data to support that such fix would actually solve the problem.