HAProxy 1.5.18 Centos 6 CPU Issue

JamesM · January 9, 2018, 11:06am

Since installing Kernel 2.6.32-696.18.7.el6 (Spectre / Meltdown mitigation) on our Centos 6 HAProxy box, the CPU has gone from around 20% to 100%.

The box is dedicated to running HAProxy (along with keepalived) and is supporting a relatively low 3-400 conns/s at around 40mbps. Loadbalancing a handful of websites with SSL termination at HAProxy. The CPU usage starts of fairly normal but within a couple of hours is just stuck at 100%.

Although not the latest and greatest, the box is a reasonable spec Dell PE1950 III, with 2 x Xeon x5460 quad core processors at 3.16ghz, 16gb of ram and Intel gb nics.

This has been running flawlessly for a year or so prior to installing 2.6.32-696.18.7

Any other experiences / advice would be greatly appreciated, for now I’ll rollback the Kernel.

lukastribus · January 9, 2018, 11:23am

Are you absolutely sure that the increased CPU load is not just only about the initial SSL handshakes that need to negotiated? When you downgraded the kernel, does the CPU not spike at 100% initially?

Does the CPU load come from user space or kernel?

JamesM · January 9, 2018, 11:32am

I’m not really sure how I would tell? Pre-upgrade there were occasional spikes in cpu, but generally this was fairly low (below 20%). Now, the HAProxy process is just stuck at 100%

The majority of the cpu load is coming from user space.

Thanks
James

lukastribus · January 9, 2018, 12:43pm

If its stays at 100% and does not decrease to normal values after some time (tens of minutes), then it’s not the initial SSL negotiation.

Are you able to upgrade haproxy to 1.7.10 to make sure this new situation isn’t just triggering an dormant bug in haproxy?

Can you provide the output of “strace -tt” on haproxy, while in the 100% cpu load situation?

JamesM · January 9, 2018, 1:19pm

No, its more like it starts low then ramps up to 100% after a couple of hours.

I’ll try upgrading to 1.7.10 on our QA box first and then see about upgrading to 1.7.10 in production. I was sticking to 1.5 as that’s what was in the repository for CentOS.

I can attach to the process but obviously the output will be huge and possibly contain sensitive data? Is there something I can do with it?

JamesM · January 9, 2018, 1:27pm

I tried strace -cp:

strace -cp 4124
Process 4124 attached
^CProcess 4124 detached

% time	seconds	usecs/call	calls	errors	syscall
------	-----------	-----------	---------	---------	----------------
41.27	0.006073	0	253351		epoll_wait
14.82	0.002181	0	11840	206	write
11.29	0.001662	0	9411	1944	sendto
8.31	0.001223	0	29402	9043	read
5.74	0.000845	0	3505	3452	connect
5.72	0.000842	0	11992	3710	recvfrom
3.59	0.000529	0	5780		close
2.33	0.000343	0	10299		epoll_ctl
1.86	0.000273	0	3424		socket
1.30	0.000192	0	6682		setsockopt
1.28	0.000189	0	4385	2020	accept4
0.99	0.000145	0	2299	6	shutdown
0.60	0.000089	0	1849		getsockname
0.58	0.000085	0	2426		getsockopt
0.31	0.000045	0	3424		fcntl
------	-----------	-----------	---------	---------	----------------
100.00	0.014716		360069	20381	total

If this helps?

ghp · January 9, 2018, 1:32pm

This could simply be a consequence of the patch you installed. Consider yourself lucky your system still boots! Haven’t you read about the Meltdown/Spectre patches before installing them?

JamesM · January 9, 2018, 1:36pm

I did, and I did install it on our QA and Dev environments first (which are nearly identical) to make sure the system was still operational. This does seem to be load related and, as you say, more than likely a consequence of the upgraded kernel.

Thanks

lukastribus · January 9, 2018, 2:40pm

Meltdown/Spectre patches don’t make CPU load increase from 20 to 100%. To 30% maybe, but not to 100%. Either the Redhat 2.6.32 Meltdown backport is buggy, we are triggering a haproxy bug or the load simply changed.

JamesM · January 9, 2018, 2:44pm

Thanks Lukas, I can confirm the load (from an external point of view) hasn’t changed.

I have seen other issues relating to HAProxy mentioning high epoll_wait usage, so I wonder if this is relevant?

lukastribus · January 9, 2018, 3:05pm

I don’t think so, high epoll usage is expected. Try the “noepoll” configuration directive to use a different code path, but epoll is the most efficient one for linux.

If you see lower CPU usage with noepoll, that would be indicative of a haproxy bug.

JamesM · January 9, 2018, 3:07pm

Thanks, I will try this and report back

ghp · January 9, 2018, 4:16pm

I don’t know how long your strace ran, but you can clearly see the 100% cpu could not come from system. But you did make 360K x 2 switches between user and kernel space, and it’s those switches that are penalized by the patches. YMMV, as they say.

JamesM · January 9, 2018, 4:31pm

I ran the strace for about 10 seconds.

JamesM · January 9, 2018, 4:39pm

Re-ran it for a minute

% time	seconds	usecs/call	calls	errors	syscall
------	-----------	-----------	---------	---------	----------------
36.56	0.028022	0	1062162		epoll_wait
16.75	0.012841	0	65988	931	write
11.10	0.008504	0	50267	9032	sendto
7.34	0.005624	0	159514	48444	read
6.27	0.004806	0	31518		close
5.89	0.004511	0	66079	20134	recvfrom
4.31	0.003301	0	19345	19050	connect
2.96	0.002271	0	52184		epoll_ctl
2.29	0.001755	0	18928		socket
2.01	0.001544	0	22775	10241	accept4
1.66	0.001271	0	12503	25	shutdown
1.17	0.000894	0	36565		setsockopt
0.68	0.000519	0	18928		fcntl
0.67	0.000511	0	13011		getsockopt
0.35	0.000270	0	10026		getsockname
------	-----------	-----------	---------	---------	----------------
100.00	0.076644		1639793	107857	total

JamesM · January 9, 2018, 5:44pm

I turned the noepoll option on. CPU usage seems worse than usual from the start, but will see what it’s like in a couple of hours

% time	seconds	usecs/call	calls	errors	syscall
------	-----------	-----------	---------	---------	----------------
99.39	3.293753	89	36864		poll
0.13	0.004175	0	20488	3083	sendto
0.12	0.004077	0	19606	446	write
0.09	0.003060	0	64347	18843	read
0.07	0.002219	0	13458		close
0.05	0.001588	0	8183	8025	connect
0.04	0.001286	0	20659	8632	recvfrom
0.03	0.000929	0	21920		setsockopt
0.02	0.000796	0	7960		socket
0.02	0.000736	0	10053	4486	accept4
0.02	0.000646	0	5509	11	shutdown
0.01	0.000244	0	7960		fcntl
0.01	0.000205	0	5692		getsockopt
0.00	0.000113	0	4228		getsockname
0.00	0.000000	0	37		brk
0.00	0.000000	0	1		restart_syscall
------	-----------	-----------	---------	---------	----------------
100.00	3.313827		246965	43526	total

JamesM · January 9, 2018, 8:05pm

After a couple of hours, while this has reduced the CPU load a little (now around 98%) this hasn’t had much real impact.

I’ve switched the load over to the failover unit, which is sitting at around 20% CPU, like the primary was pre-update.

Below is the strace for 1 minute on the failover unit (identical spec and config)

% time    		 seconds 		 usecs/call 		    calls		   errors		syscall
------		 ----------- 		----------- 		---------		---------		----------------
 84.82		    0.163654 		          2 		    70863		         		epoll_wait
  3.35		    0.006461 		          0 		    37519		      761		write
  3.23		    0.006225 		          0 		    31749		     7790		sendto
  1.72		    0.003314 		          0 		    17878		         		close
  1.69		    0.003256 		          0 		    84927		    25444		read
  1.68		    0.003242 		          0 		    40658		    12424		recvfrom
  0.95		    0.001833 		          0 		    11529		    11238		connect
  0.66		    0.001271 		          0 		    35572		         		epoll_ctl
  0.56		    0.001090 		          0 		    13065		     6127		accept4
  0.41		    0.000783 		          0 		    11032		         		socket
  0.34		    0.000647 		          0 		    21296		         		setsockopt
  0.20		    0.000387 		          0 		     6586		        3		shutdown
  0.18		    0.000347 		          0 		    11032		         		fcntl
  0.14		    0.000261 		          0 		     8442		         		getsockopt
  0.09		    0.000174 		          0 		     5277		         		getsockname
  0.00		    0.000000 		          0 		       21		         		brk
------		 ----------- 		----------- 		---------		---------		----------------
100.00		    0.192945 		            		   407446		    63787		total

lukastribus · January 9, 2018, 8:26pm

Doesn’t seem like an epoll issue within haproxy then (noepoll performance is worse). It’s possible that the meltdown mitigation causes some issue in the kernel with polling overall, but really I don’t have an idea at this point.

If you could make that jump to 1.7.10 that would be really helpful to exclude bugs in that old 1.5.18 release.

If 1.7.10 has the same issue, than we will have to take a deeper look at that strace, not a summary but the actual output of “strace -tt -p”. If you are concerned about posting it publically (which is understandable), you can send it to directly to @willy (I’m probably not able to spot the issue anyway).

willy · January 10, 2018, 4:58am

I’m seeing a huge amount of poll/epoll calls compared to recv/send so this makes me think about a stuck event somewhere. We’ve had two recent fixes in 1.7 for half-open connections which could trigger such an issue, I don’t know if older versions are affected as well or not. One of them affects splicing and the other one client-fin. We backported the fix to 1.6 however.

Regardless of this, it seems strange that a kernel update revealed an existing bug, and that reverting it removes the bug. So it might well be that a kernel change has some effect (maybe just some timing effect triggering the issue).

As Lukas suggested, trying 1.7.10 would help a lot. If the problem disappears, we can try to see what fix could be missing from 1.5.

JamesM · January 10, 2018, 12:26pm

Thanks, is there anything I should be aware of before jumping from 1.5.18 to 1.7.10?

I’ve got it installed on our dev and QA environments now and with the exception of a few minor config changes, all seems well. Just keen to get production up and running smoothly as well.

Topic		Replies	Views
Very high CPU usage and very slow startup times Help!	3	2466	December 11, 2015
HAProxy stops serving frontend requests while not closing backend connections at 100% cpu utilisation Help!	10	7039	August 7, 2019
HAProxy 1.8.4 at 100% CPU right after startup Help!	8	3045	June 21, 2018
High CPU usage after upgrading Haproxy Help!	6	3361	March 29, 2017
100% CPU usage with HAproxy 2.0.31 Help!	2	1154	April 21, 2023

HAProxy 1.5.18 Centos 6 CPU Issue

Related topics