HAProxy 1.5.18 Centos 6 CPU Issue

lukastribus · January 10, 2018, 1:30pm

I don’t think you should see any major surprises. Minor config changes that are rejected otherwise, yes.

But silent failures in the background without you noticing it in QA/testing I don’t think so.

JamesM · January 10, 2018, 3:40pm

I’ve had 1.7.10 running in production for nearly 2 hours now and the CPU usage looks much better. It’s still higher than it was before (to be expected) but I’m now sitting at around 30-50% which is fine.

It looks like it must have been a bug in 1.5.18 that was causing the CPU issue.

Thanks again for your assistance.

lukastribus · January 10, 2018, 4:22pm

Great.

This could also be something that was fixed in 1.5.19. If you want to make sure, you could give 1.5.19 a try as well - but 1.7.10 is the release I’d recommend for this deployment at this point.

JamesM · January 11, 2018, 7:49am

Strangely, after being left overnight this has gone back to 100% CPU on 1.7.10

willy · January 11, 2018, 9:35am

James, I just checked the trace you sent. I’m seeing a lot of epoll_wait()=0 in it, indicating a bug in haproxy. There probably is either a timeout which fails to be handled, or something like this, which we need to investigate. I’m also seeing a lot of small recvfrom() and write(), typically 4 bytes. This one explains why you observe a higher CPU load than before the kernel patches when it’s not at 100%.

Regarding the first point (epoll calls), I took a look at the trace and found a few ECONNRESET returned for read() calls made by the SSL stack. It could be possible that we occasionally meet a case where an SSL error is not properly reported through the whole stack and that we continue to wait for missing data for example. The problem is that all epoll_wait() calls are having an empty timeout, indicating something stuck somewhere.

If you could also post the output of a “show info” and a “show sess all” executed on your CLI when you see that the problem is in action, that would be awesome. Be careful, the latter will be huge, directly send it to a file like this : echo “show sess all” | socat - /var/run/haproxy.sock > show-sess-all.log. And it will reveal some information (source addresses, etc).

However, on a positive note, despite eating your CPU, the problem must not cause any slowdown on the service. The reason it’s at 100% CPU is that it polls at full speed, never waiting for a new event to come. The biggest problem in fact is that it’s harder to observe activity and the available CPU margin.

It’s really interesting that this started to happen after a kernel change. I hardly see why, but maybe the way a specific event or error is reported has triggered a different code path in haproxy and revealed a bug.

JamesM · January 11, 2018, 1:19pm

Hi Willy, Thank you for looking at this. Please see below the show info and I’ll PM the show sess all again.

Thanks again for your help

James

Name: HAProxy
Version: 1.7.10-a7dcc3b
Release_date: 2018/01/02
Nbproc: 1
Process_num: 1
Pid: 15315
Uptime: 0d 23h15m23s
Uptime_sec: 83723
Memmax_MB: 0
PoolAlloc_MB: 46
PoolUsed_MB: 12
PoolFailed: 0
Ulimit-n: 16365
Maxsock: 16365
Maxconn: 8000
Hard_maxconn: 8000
CurrConns: 3401
CumConns: 9782784
CumReq: 23369840
MaxSslConns: 0
CurrSslConns: 3356
CumSslConns: 8375074
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 156
ConnRateLimit: 0
MaxConnRate: 644
SessRate: 156
SessRateLimit: 0
MaxSessRate: 644
SslRate: 143
SslRateLimit: 0
MaxSslRate: 631
SslFrontendKeyRate: 39
SslFrontendMaxKeyRate: 195
SslFrontendSessionReuse_pct: 73
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 0
SslCacheLookups: 6411951
SslCacheMisses: 88307
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
ZlibMemUsage: 0
MaxZlibMemUsage: 0
Tasks: 3750
Run_queue: 1
Idle_pct: 39
node:

willy · January 11, 2018, 1:49pm

Excellent, thank you! I found the problematic one, now we have to find why/how it entered this state. It’s the first one in your show sess, I’m reproducing it anonymized

0x3aa61c0: [10/Jan/2018:23:28:03.338249] id=11356572 proto=tcpv4 source=XX.XX.XX.XXX:20639
flags=0x1ce, conn_retries=2, srv_conn=0x2548ea0, pend_pos=(nil)
frontend=xxxx-xxxxxxxx-xxx (id=8 mode=tcp), listener=? (id=1) addr=XXX.XXX.XXX.XX:25
backend=xxxx-xxxxxxxx-xxx (id=51 mode=tcp) addr=XXX.XXX.X.X:33522
server=XXXXXXXX (id=3) addr=XXX.XXX.X.XX:2507
task=0x34bc840 (state=0x04 nice=0 calls=-310869292 exp=, running age=13h40m)
si[0]=0x3aa63f8 (state=CLO flags=0x00 endp0=CONN:0x44e0010 exp=, et=0x000)
si[1]=0x3aa6420 (state=EST flags=0x10 endp1=CONN:0x6d98380 exp=, et=0x000)
co0=0x44e0010 ctrl=tcpv4 xprt=RAW data=STRM target=LISTENER:0x232ccf0
flags=0x802eb300 fd=946 fd.state=22 fd.cache=0 updt=0
co1=0x6d98380 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0x2548ea0
flags=0x0020b300 fd=1086 fd.state=22 fd.cache=0 updt=0
req=0x3aa61d0 (f=0x80a020 an=0x0 pipe=0 tofwd=-1 total=0)
an_exp= rex= wex=?
buf=0x6ee290 data=0x6ee2a4 o=0 p=0 req.next=0 i=0 size=0
res=0x3aa6210 (f=0x8000a020 an=0x0 pipe=0 tofwd=0 total=0)
an_exp= rex=? wex=
buf=0x6ee290 data=0x6ee2a4 o=0 p=0 rsp.next=0 i=0 size=0

exp= is the reason why the poller loops. It wakes up because the expiration time was reached, but for some reason it is not processed and goes to sleep again. Its state is unusual, closed with the client, and still established with the server.

I’ll have to inspect all flags now. In the mean time you can kill it this way on the CLI :

shutdown session 0x3aa61c0

Since I don’t know why the timeout condition is ignored, there’s a small possibility that the shutdown will be ignored as well.

JamesM · January 11, 2018, 1:55pm

Thanks, I’ve issued the shutdown, but it has ignored it unfortunately. The session is still there.

Interestingly (maybe), that’s not an http session, it’s a mail server.

Thanks again

willy · January 11, 2018, 2:27pm

OK. I notice that among the ~3400 connections in this dump, only 2 faced redispatch, indicating an failure to first establish a connection the the server (this could possibly result from the kernel upgrade). The fact that the faulty connection was one of them and that the client side was closed makes me think there might be a race if the client closes during the redispatch or something like this.

JamesM · January 11, 2018, 2:28pm

Actually, I’ve just taken a look at another box in the cluster. This is the one I was (still am) running 1.5.18 on when the problem occurred originally. I switched over to the 1.7.10 box by changing the vrrp priorities so HAProxy hasn’t actually been restarted yet.

Anyway, its cpu is still at 100% and has 11 sessions still open and they are all tcp connections to mail servers.

willy · January 11, 2018, 2:51pm

James, I suspect there’s a bug in the way we deal with client aborts upon redispatch. At least I’m not seeing anything preventing from retrying a connection that was just aborted. Could you please try to apply the following patch ? (sorry for the format, it’s not very convenient in a web form).

 diff --git a/src/stream.c b/src/stream.c
 index 2298111..0b643b6 100644
 --- a/src/stream.c
 +++ b/src/stream.c
 @@ -746,7 +746,7 @@ static void sess_establish(struct stream *s)
 static int check_req_may_abort(struct channel *req, struct stream *s)
 {
        return ((req->flags & (CF_READ_ERROR)) ||
 -               ((req->flags & CF_SHUTW_NOW) &&  /* empty and client aborted */
 +               ((req->flags & (CF_SHUTW_NOW|CF_SHUTW)) &&  /* empty and client aborted */
                 (channel_is_empty(req) || s->be->options & PR_O_ABRT_CLOSE)));
 }

JamesM · January 11, 2018, 3:03pm

So basically change this line:
((req->flags & CF_SHUTW_NOW) && /* empty and client aborted */

To this?:
((req->flags & (CF_SHUTW_NOW|CF_SHUTW)) && /* empty and client aborted */

I’ve done this on my dev and QA servers. Will roll this out to production tonight and report back.

Thanks again

willy · January 11, 2018, 3:05pm

yes that’s exactly the change. Till now it aborts if there is a pending connection close. But since SMTP is a banner protocol, the buffer is empty and the SHUTW_NOW immediately turns into SHUTW. So if this happens during a redispatch I don’t find a situation where we properly take care of it. That might explain why it was not yet reported, as banner protocols are less often used.

JamesM · January 11, 2018, 3:06pm

Great, thanks. That makes sense

JamesM · January 11, 2018, 10:04pm

I’ve rolled this our to our production servers now and all seems ok.

Will report back once it’s been left to soak for a few hours.

Thanks again

JamesM · January 12, 2018, 7:33am

This looks to be resolved now.

Thank you again for your help on this.

willy · January 12, 2018, 9:38am

excellent, I’ll merge it now.

Thanks!

ghp · January 18, 2018, 10:26am

Willy, why do you think James ran into this problem only now? He probably was using SMTP before.

JamesM · January 18, 2018, 10:38am

I was indeed using SMTP before (have been for a long time)

lukastribus · January 23, 2018, 7:09pm

If I understand correctly, James starting seeing this problem because connection establishment to the backend servers began to fail, which previously did not. This could have been triggered by the kernel upgrade.

Topic		Replies	Views
Very high CPU usage and very slow startup times Help!	3	2475	December 11, 2015
HAProxy stops serving frontend requests while not closing backend connections at 100% cpu utilisation Help!	10	7067	August 7, 2019
HAProxy 1.8.4 at 100% CPU right after startup Help!	8	3052	June 21, 2018
High CPU usage after upgrading Haproxy Help!	6	3368	March 29, 2017
100% CPU usage with HAproxy 2.0.31 Help!	2	1156	April 21, 2023

HAProxy 1.5.18 Centos 6 CPU Issue

Related topics