High CPU usage when exists CLOSE_WAIT connections


#1

I’m using https://github.com/mesosphere/marathon-lb to configure the HAProxy. Because of it change the configuration with any change in our microservices instances, it create a lot of haproxy process. For example:

# pgrep haproxy -a
690 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 70155
3108 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 72735
7540 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 77012
8297 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 77058
9651 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 78452
10690 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 79475
15639 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 84966
15760 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 85082
16574 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 85637
16923 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 86235
17022 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 86278
17672 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 86375
18060 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 87011
18620 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 87398
19470 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 87809
20350 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 88653
52146 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 20253
52339 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 20744
53367 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 20934
53468 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 21957
53710 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22058
54324 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 22295
54967 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 23482
55476 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 23537
55796 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 24162
55987 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 24199
56180 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 24546
56246 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 24582
56519 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 24879
57201 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 25317
57546 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 25568
57774 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 26080
60398 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 30723
60783 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 31118
89062 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 56479
89509 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 57252
89784 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 57531
90949 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 58144
91675 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 59389
93436 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 59422
93677 haproxy -p /tmp/haproxy.pid -f /marathon-lb/haproxy.cfg -D -sf 61172

The haproxy process will finish when all connection are finished. But sometimes a process only has connection with CLOSE_WAIT status and it will consume 100% of the CPU.

# lsof -i | awk '{if($1 == "haproxy") print $2 " " $10}' | sort -u
10690 (ESTABLISHED)
15760 (ESTABLISHED)
16574 (ESTABLISHED)
16923 (ESTABLISHED)
17022 (ESTABLISHED)
17672 (ESTABLISHED)
18060 (ESTABLISHED)
18620 (ESTABLISHED)
19470 (ESTABLISHED)
20350 (CLOSE_WAIT)
20350 (ESTABLISHED)
20350 (LISTEN)
3108 (ESTABLISHED)
52146 (ESTABLISHED)
52339 (ESTABLISHED)
53367 (ESTABLISHED)
53468 (ESTABLISHED)
53710 (ESTABLISHED)
54324 (ESTABLISHED)
54967 (ESTABLISHED)
55476 (ESTABLISHED)
55796 (ESTABLISHED)
55987 (ESTABLISHED)
56180 (ESTABLISHED)
56246 (ESTABLISHED)
56519 (ESTABLISHED)
57201 (ESTABLISHED)
57546 (ESTABLISHED)
57774 (ESTABLISHED)
60398 (CLOSE_WAIT)
60783 (CLOSE_WAIT)
690 (ESTABLISHED)
7540 (ESTABLISHED)
8297 (ESTABLISHED)
89062 (ESTABLISHED)
89509 (ESTABLISHED)
89784 (ESTABLISHED)
90949 (ESTABLISHED)
91675 (ESTABLISHED)
93436 (ESTABLISHED)
93677 (ESTABLISHED)

Here the processes 60398 and 60783 only has CLOSE_WAIT status.

# lsof -i | awk '{if($2 == "60398") print $2 " " $9 " " $10}'
60398 mesos-lb-3.mydomain:35819->leia-5.mydomain:31302 (CLOSE_WAIT)
# lsof -i | awk '{if($2 == "60783") print $2 " " $9 " " $10}'
60783 mesos-lb-3.mydomain:37419->leia-8.mydomain:31682 (CLOSE_WAIT)

The strace of booth show the same result:

poll(0x7fe774e2e010, 0, 0)              = 0 (Timeout)
poll(0x7fe774e2e010, 0, 0)              = 0 (Timeout)
poll(0x7fe774e2e010, 0, 0)              = 0 (Timeout)
poll(0x7fe774e2e010, 0, 0)              = 0 (Timeout)
poll(0x7fe774e2e010, 0, 0)              = 0 (Timeout)
poll(0x7fe774e2e010, 0, 0)              = 0 (Timeout)

The tcpdump -vv port 35819 didn’t show nothing. The same for the port 37419

About the enviroment:

# haproxy -vv
HA-Proxy version 1.6.9 2016/08/30
Copyright 2000-2016 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = custom
  CPU     = x86_64
  CC      = gcc
  CFLAGS  = -g -fno-strict-aliasing -Wdeclaration-after-statement
  OPTIONS = USE_LINUX_SPLICE=1 USE_LINUX_TPROXY=1 USE_LIBCRYPT=1 USE_ZLIB=1 USE_POLL=default USE_DL=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2j  26 Sep 2016
Running on OpenSSL version : OpenSSL 1.0.2j  26 Sep 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Built with Lua version : Lua 5.3.3
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 2 (2 usable), will use poll.

It’s running using docker and the image was based on debian:stretch image.
The docker host info is:

# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.5 LTS
Release:	14.04
Codename:	trusty

# uname -a
Linux mesos-lb-4.mydomain 3.13.0-98-generic #145-Ubuntu SMP Sat Oct 8 20:13:07 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

# docker version
Client:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:54:52 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:54:52 2016
 OS/Arch:      linux/amd64

Any idea what’s happing?
Thanks for the attention