Since installing Kernel 2.6.32-696.18.7.el6 (Spectre / Meltdown mitigation) on our Centos 6 HAProxy box, the CPU has gone from around 20% to 100%.
The box is dedicated to running HAProxy (along with keepalived) and is supporting a relatively low 3-400 conns/s at around 40mbps. Loadbalancing a handful of websites with SSL termination at HAProxy. The CPU usage starts of fairly normal but within a couple of hours is just stuck at 100%.
Although not the latest and greatest, the box is a reasonable spec Dell PE1950 III, with 2 x Xeon x5460 quad core processors at 3.16ghz, 16gb of ram and Intel gb nics.
This has been running flawlessly for a year or so prior to installing 2.6.32-696.18.7
Any other experiences / advice would be greatly appreciated, for now I’ll rollback the Kernel.
Are you absolutely sure that the increased CPU load is not just only about the initial SSL handshakes that need to negotiated? When you downgraded the kernel, does the CPU not spike at 100% initially?
I’m not really sure how I would tell? Pre-upgrade there were occasional spikes in cpu, but generally this was fairly low (below 20%). Now, the HAProxy process is just stuck at 100%
The majority of the cpu load is coming from user space.
No, its more like it starts low then ramps up to 100% after a couple of hours.
I’ll try upgrading to 1.7.10 on our QA box first and then see about upgrading to 1.7.10 in production. I was sticking to 1.5 as that’s what was in the repository for CentOS.
I can attach to the process but obviously the output will be huge and possibly contain sensitive data? Is there something I can do with it?
This could simply be a consequence of the patch you installed. Consider yourself lucky your system still boots! Haven’t you read about the Meltdown/Spectre patches before installing them?
I did, and I did install it on our QA and Dev environments first (which are nearly identical) to make sure the system was still operational. This does seem to be load related and, as you say, more than likely a consequence of the upgraded kernel.
Meltdown/Spectre patches don’t make CPU load increase from 20 to 100%. To 30% maybe, but not to 100%. Either the Redhat 2.6.32 Meltdown backport is buggy, we are triggering a haproxy bug or the load simply changed.
I don’t think so, high epoll usage is expected. Try the “noepoll” configuration directive to use a different code path, but epoll is the most efficient one for linux.
If you see lower CPU usage with noepoll, that would be indicative of a haproxy bug.
I don’t know how long your strace ran, but you can clearly see the 100% cpu could not come from system. But you did make 360K x 2 switches between user and kernel space, and it’s those switches that are penalized by the patches. YMMV, as they say.
Doesn’t seem like an epoll issue within haproxy then (noepoll performance is worse). It’s possible that the meltdown mitigation causes some issue in the kernel with polling overall, but really I don’t have an idea at this point.
If you could make that jump to 1.7.10 that would be really helpful to exclude bugs in that old 1.5.18 release.
If 1.7.10 has the same issue, than we will have to take a deeper look at that strace, not a summary but the actual output of “strace -tt -p”. If you are concerned about posting it publically (which is understandable), you can send it to directly to @willy (I’m probably not able to spot the issue anyway).
I’m seeing a huge amount of poll/epoll calls compared to recv/send so this makes me think about a stuck event somewhere. We’ve had two recent fixes in 1.7 for half-open connections which could trigger such an issue, I don’t know if older versions are affected as well or not. One of them affects splicing and the other one client-fin. We backported the fix to 1.6 however.
Regardless of this, it seems strange that a kernel update revealed an existing bug, and that reverting it removes the bug. So it might well be that a kernel change has some effect (maybe just some timing effect triggering the issue).
As Lukas suggested, trying 1.7.10 would help a lot. If the problem disappears, we can try to see what fix could be missing from 1.5.
Thanks, is there anything I should be aware of before jumping from 1.5.18 to 1.7.10?
I’ve got it installed on our dev and QA environments now and with the exception of a few minor config changes, all seems well. Just keen to get production up and running smoothly as well.