HAProxy 1.5.18 Centos 6 CPU Issue

I don’t think you should see any major surprises. Minor config changes that are rejected otherwise, yes.

But silent failures in the background without you noticing it in QA/testing I don’t think so.

I’ve had 1.7.10 running in production for nearly 2 hours now and the CPU usage looks much better. It’s still higher than it was before (to be expected) but I’m now sitting at around 30-50% which is fine.

It looks like it must have been a bug in 1.5.18 that was causing the CPU issue.

Thanks again for your assistance.

Great.

This could also be something that was fixed in 1.5.19. If you want to make sure, you could give 1.5.19 a try as well - but 1.7.10 is the release I’d recommend for this deployment at this point.

Strangely, after being left overnight this has gone back to 100% CPU on 1.7.10

James, I just checked the trace you sent. I’m seeing a lot of epoll_wait()=0 in it, indicating a bug in haproxy. There probably is either a timeout which fails to be handled, or something like this, which we need to investigate. I’m also seeing a lot of small recvfrom() and write(), typically 4 bytes. This one explains why you observe a higher CPU load than before the kernel patches when it’s not at 100%.

Regarding the first point (epoll calls), I took a look at the trace and found a few ECONNRESET returned for read() calls made by the SSL stack. It could be possible that we occasionally meet a case where an SSL error is not properly reported through the whole stack and that we continue to wait for missing data for example. The problem is that all epoll_wait() calls are having an empty timeout, indicating something stuck somewhere.

If you could also post the output of a “show info” and a “show sess all” executed on your CLI when you see that the problem is in action, that would be awesome. Be careful, the latter will be huge, directly send it to a file like this : echo “show sess all” | socat - /var/run/haproxy.sock > show-sess-all.log. And it will reveal some information (source addresses, etc).

However, on a positive note, despite eating your CPU, the problem must not cause any slowdown on the service. The reason it’s at 100% CPU is that it polls at full speed, never waiting for a new event to come. The biggest problem in fact is that it’s harder to observe activity and the available CPU margin.

It’s really interesting that this started to happen after a kernel change. I hardly see why, but maybe the way a specific event or error is reported has triggered a different code path in haproxy and revealed a bug.

Hi Willy, Thank you for looking at this. Please see below the show info and I’ll PM the show sess all again.

Thanks again for your help

James

Name: HAProxy
Version: 1.7.10-a7dcc3b
Release_date: 2018/01/02
Nbproc: 1
Process_num: 1
Pid: 15315
Uptime: 0d 23h15m23s
Uptime_sec: 83723
Memmax_MB: 0
PoolAlloc_MB: 46
PoolUsed_MB: 12
PoolFailed: 0
Ulimit-n: 16365
Maxsock: 16365
Maxconn: 8000
Hard_maxconn: 8000
CurrConns: 3401
CumConns: 9782784
CumReq: 23369840
MaxSslConns: 0
CurrSslConns: 3356
CumSslConns: 8375074
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 156
ConnRateLimit: 0
MaxConnRate: 644
SessRate: 156
SessRateLimit: 0
MaxSessRate: 644
SslRate: 143
SslRateLimit: 0
MaxSslRate: 631
SslFrontendKeyRate: 39
SslFrontendMaxKeyRate: 195
SslFrontendSessionReuse_pct: 73
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 0
SslCacheLookups: 6411951
SslCacheMisses: 88307
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
ZlibMemUsage: 0
MaxZlibMemUsage: 0
Tasks: 3750
Run_queue: 1
Idle_pct: 39
node:

Excellent, thank you! I found the problematic one, now we have to find why/how it entered this state. It’s the first one in your show sess, I’m reproducing it anonymized :slight_smile:

0x3aa61c0: [10/Jan/2018:23:28:03.338249] id=11356572 proto=tcpv4 source=XX.XX.XX.XXX:20639
flags=0x1ce, conn_retries=2, srv_conn=0x2548ea0, pend_pos=(nil)
frontend=xxxx-xxxxxxxx-xxx (id=8 mode=tcp), listener=? (id=1) addr=XXX.XXX.XXX.XX:25
backend=xxxx-xxxxxxxx-xxx (id=51 mode=tcp) addr=XXX.XXX.X.X:33522
server=XXXXXXXX (id=3) addr=XXX.XXX.X.XX:2507
task=0x34bc840 (state=0x04 nice=0 calls=-310869292 exp=, running age=13h40m)
si[0]=0x3aa63f8 (state=CLO flags=0x00 endp0=CONN:0x44e0010 exp=, et=0x000)
si[1]=0x3aa6420 (state=EST flags=0x10 endp1=CONN:0x6d98380 exp=, et=0x000)
co0=0x44e0010 ctrl=tcpv4 xprt=RAW data=STRM target=LISTENER:0x232ccf0
flags=0x802eb300 fd=946 fd.state=22 fd.cache=0 updt=0
co1=0x6d98380 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0x2548ea0
flags=0x0020b300 fd=1086 fd.state=22 fd.cache=0 updt=0
req=0x3aa61d0 (f=0x80a020 an=0x0 pipe=0 tofwd=-1 total=0)
an_exp= rex= wex=?
buf=0x6ee290 data=0x6ee2a4 o=0 p=0 req.next=0 i=0 size=0
res=0x3aa6210 (f=0x8000a020 an=0x0 pipe=0 tofwd=0 total=0)
an_exp= rex=? wex=
buf=0x6ee290 data=0x6ee2a4 o=0 p=0 rsp.next=0 i=0 size=0

exp= is the reason why the poller loops. It wakes up because the expiration time was reached, but for some reason it is not processed and goes to sleep again. Its state is unusual, closed with the client, and still established with the server.

I’ll have to inspect all flags now. In the mean time you can kill it this way on the CLI :

shutdown session 0x3aa61c0

Since I don’t know why the timeout condition is ignored, there’s a small possibility that the shutdown will be ignored as well.

Thanks, I’ve issued the shutdown, but it has ignored it unfortunately. The session is still there.

Interestingly (maybe), that’s not an http session, it’s a mail server.

Thanks again

OK. I notice that among the ~3400 connections in this dump, only 2 faced redispatch, indicating an failure to first establish a connection the the server (this could possibly result from the kernel upgrade). The fact that the faulty connection was one of them and that the client side was closed makes me think there might be a race if the client closes during the redispatch or something like this.

Actually, I’ve just taken a look at another box in the cluster. This is the one I was (still am) running 1.5.18 on when the problem occurred originally. I switched over to the 1.7.10 box by changing the vrrp priorities so HAProxy hasn’t actually been restarted yet.

Anyway, its cpu is still at 100% and has 11 sessions still open and they are all tcp connections to mail servers.

James, I suspect there’s a bug in the way we deal with client aborts upon redispatch. At least I’m not seeing anything preventing from retrying a connection that was just aborted. Could you please try to apply the following patch ? (sorry for the format, it’s not very convenient in a web form).

 diff --git a/src/stream.c b/src/stream.c
 index 2298111..0b643b6 100644
 --- a/src/stream.c
 +++ b/src/stream.c
 @@ -746,7 +746,7 @@ static void sess_establish(struct stream *s)
 static int check_req_may_abort(struct channel *req, struct stream *s)
 {
        return ((req->flags & (CF_READ_ERROR)) ||
 -               ((req->flags & CF_SHUTW_NOW) &&  /* empty and client aborted */
 +               ((req->flags & (CF_SHUTW_NOW|CF_SHUTW)) &&  /* empty and client aborted */
                 (channel_is_empty(req) || s->be->options & PR_O_ABRT_CLOSE)));
 }

So basically change this line:
((req->flags & CF_SHUTW_NOW) && /* empty and client aborted */

To this?:
((req->flags & (CF_SHUTW_NOW|CF_SHUTW)) && /* empty and client aborted */

I’ve done this on my dev and QA servers. Will roll this out to production tonight and report back.

Thanks again

yes that’s exactly the change. Till now it aborts if there is a pending connection close. But since SMTP is a banner protocol, the buffer is empty and the SHUTW_NOW immediately turns into SHUTW. So if this happens during a redispatch I don’t find a situation where we properly take care of it. That might explain why it was not yet reported, as banner protocols are less often used.

Great, thanks. That makes sense

I’ve rolled this our to our production servers now and all seems ok.

Will report back once it’s been left to soak for a few hours.

Thanks again

This looks to be resolved now.

Thank you again for your help on this.

excellent, I’ll merge it now.

Thanks!

Willy, why do you think James ran into this problem only now? He probably was using SMTP before.

I was indeed using SMTP before (have been for a long time)

If I understand correctly, James starting seeing this problem because connection establishment to the backend servers began to fail, which previously did not. This could have been triggered by the kernel upgrade.