Hi lukastribus,
this weekend the described problem occurred again (on native hardware), after we had a higher load for our circumstances. The processes of the HAProxy did not respond to any requests this morning so I restarted the service after running strace and collected some data. The remaining two HAProxies where not affected (VMs). All three systems have the same configuration except for the cpu mapping because ndmzlb5 has 8 Cores instead of 4.
The load peak occurred on Friday at ~ 20:04. The HAProxy keeps a higher cpu utilization around 19% and stopped handling requests on Saturday around 23:53. From Saturday 23:53 until this morning there a no log entries of the HAProxy. The operating system itself and other processes were not affected.
CPU Load on affected HAProxy:
Load outgoing internet connection through HAProxy:
CPU Load on the remaining HAProxies:
Output of ps on ndmzlb5:
Mo 11. Feb 07:58:02 CET 2019
07:58:02 up 9 days, 21:56, 1 user, load average: 8,01, 8,04, 8,05
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2 0.0 0.0 0 0 ? S Feb01 0:04 [kthreadd]
root 3 0.0 0.0 0 0 ? S Feb01 0:40 \_ [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/0:0H]
root 7 0.0 0.0 0 0 ? S Feb01 0:08 \_ [migration/0]
root 8 0.0 0.0 0 0 ? S Feb01 0:00 \_ [rcu_bh]
root 9 0.4 0.0 0 0 ? S Feb01 70:15 \_ [rcu_sched]
root 10 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [lru-add-drain]
root 11 0.0 0.0 0 0 ? S Feb01 0:07 \_ [watchdog/0]
root 12 0.0 0.0 0 0 ? S Feb01 0:02 \_ [watchdog/1]
root 13 0.0 0.0 0 0 ? S Feb01 0:06 \_ [migration/1]
root 14 0.0 0.0 0 0 ? S Feb01 0:38 \_ [ksoftirqd/1]
root 16 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/1:0H]
root 17 0.0 0.0 0 0 ? S Feb01 0:02 \_ [watchdog/2]
root 18 0.0 0.0 0 0 ? S Feb01 0:06 \_ [migration/2]
root 19 0.0 0.0 0 0 ? S Feb01 0:44 \_ [ksoftirqd/2]
root 21 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/2:0H]
root 22 0.0 0.0 0 0 ? S Feb01 0:02 \_ [watchdog/3]
root 23 0.0 0.0 0 0 ? S Feb01 0:06 \_ [migration/3]
root 24 0.0 0.0 0 0 ? S Feb01 0:42 \_ [ksoftirqd/3]
root 26 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/3:0H]
root 27 0.0 0.0 0 0 ? S Feb01 0:02 \_ [watchdog/4]
root 28 0.0 0.0 0 0 ? S Feb01 0:08 \_ [migration/4]
root 29 0.0 0.0 0 0 ? S Feb01 0:17 \_ [ksoftirqd/4]
root 31 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/4:0H]
root 32 0.0 0.0 0 0 ? S Feb01 0:02 \_ [watchdog/5]
root 33 0.0 0.0 0 0 ? S Feb01 0:04 \_ [migration/5]
root 34 0.0 0.0 0 0 ? S Feb01 0:19 \_ [ksoftirqd/5]
root 36 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/5:0H]
root 37 0.0 0.0 0 0 ? S Feb01 0:02 \_ [watchdog/6]
root 38 0.0 0.0 0 0 ? S Feb01 0:06 \_ [migration/6]
root 39 0.0 0.0 0 0 ? S Feb01 0:20 \_ [ksoftirqd/6]
root 41 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/6:0H]
root 42 0.0 0.0 0 0 ? S Feb01 0:02 \_ [watchdog/7]
root 43 0.0 0.0 0 0 ? S Feb01 0:08 \_ [migration/7]
root 44 0.0 0.0 0 0 ? S Feb01 0:44 \_ [ksoftirqd/7]
root 46 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/7:0H]
root 48 0.0 0.0 0 0 ? S Feb01 0:00 \_ [kdevtmpfs]
root 49 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [netns]
root 50 0.0 0.0 0 0 ? S Feb01 0:00 \_ [khungtaskd]
root 51 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [writeback]
root 52 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kintegrityd]
root 53 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [bioset]
root 54 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [bioset]
root 55 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [bioset]
root 56 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kblockd]
root 57 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [md]
root 58 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [edac-poller]
root 59 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [watchdogd]
root 72 0.0 0.0 0 0 ? S Feb01 0:00 \_ [kswapd0]
root 73 0.0 0.0 0 0 ? SN Feb01 0:00 \_ [ksmd]
root 74 0.0 0.0 0 0 ? SN Feb01 0:04 \_ [khugepaged]
root 75 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [crypto]
root 83 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kthrotld]
root 85 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kmpath_rdacd]
root 86 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kaluad]
root 87 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kpsmoused]
root 89 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [ipv6_addrconf]
root 102 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [deferwq]
root 137 0.0 0.0 0 0 ? S Feb01 0:20 \_ [kauditd]
root 948 0.0 0.0 0 0 ? S Feb01 0:00 \_ [scsi_eh_0]
root 963 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [scsi_tmf_0]
root 1169 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [ata_sff]
root 1866 0.0 0.0 0 0 ? S Feb01 0:00 \_ [scsi_eh_1]
root 1892 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [scsi_tmf_1]
root 1914 0.0 0.0 0 0 ? S Feb01 0:00 \_ [scsi_eh_2]
root 1929 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [scsi_tmf_2]
root 3867 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [ttm_swap]
root 4288 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kdmflush]
root 4291 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [bioset]
root 4300 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kdmflush]
root 4303 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [bioset]
root 4324 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [bioset]
root 4327 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfsalloc]
root 4329 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs_mru_cache]
root 4335 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-buf/dm-1]
root 4336 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-data/dm-1]
root 4337 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-conv/dm-1]
root 4338 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-cil/dm-1]
root 4339 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-reclaim/dm-]
root 4341 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-log/dm-1]
root 4342 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-eofblocks/d]
root 4347 0.0 0.0 0 0 ? S Feb01 12:12 \_ [xfsaild/dm-1]
root 4351 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/1:1H]
root 6280 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kdmflush]
root 6281 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/6:1H]
root 6290 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [bioset]
root 6334 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kvm-irqfd-clean]
root 6782 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-buf/sda1]
root 6792 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-data/sda1]
root 6805 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-conv/sda1]
root 6831 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-cil/sda1]
root 6832 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-reclaim/sda]
root 6833 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-log/sda1]
root 6854 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-eofblocks/s]
root 6869 0.0 0.0 0 0 ? S Feb01 0:00 \_ [xfsaild/sda1]
root 8447 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-buf/dm-2]
root 8448 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-data/dm-2]
root 8449 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-conv/dm-2]
root 8450 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-cil/dm-2]
root 8451 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-reclaim/dm-]
root 8452 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-log/dm-2]
root 8453 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [xfs-eofblocks/d]
root 8454 0.0 0.0 0 0 ? S Feb01 0:00 \_ [xfsaild/dm-2]
root 8458 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/5:1H]
root 8569 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/7:1H]
root 8652 0.0 0.0 0 0 ? S< Feb01 0:03 \_ [kworker/0:1H]
root 8851 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/2:1H]
root 9574 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/3:1H]
root 11155 0.0 0.0 0 0 ? S< Feb01 0:00 \_ [kworker/4:1H]
root 32140 0.0 0.0 0 0 ? S 04:10 0:03 \_ [kworker/0:1]
root 340 0.0 0.0 0 0 ? S 04:24 0:00 \_ [kworker/u64:2]
root 4521 0.0 0.0 0 0 ? S 05:54 0:00 \_ [kworker/4:2]
root 7512 0.0 0.0 0 0 ? S 06:58 0:00 \_ [kworker/2:0]
root 7777 0.0 0.0 0 0 ? S 07:04 0:00 \_ [kworker/2:1]
root 8415 0.0 0.0 0 0 ? S 07:18 0:00 \_ [kworker/5:1]
root 8535 0.0 0.0 0 0 ? S 07:22 0:00 \_ [kworker/4:1]
root 8987 0.0 0.0 0 0 ? S 07:30 0:00 \_ [kworker/3:0]
root 9007 0.0 0.0 0 0 ? S 07:32 0:00 \_ [kworker/u64:1]
root 9405 0.0 0.0 0 0 ? S 07:38 0:00 \_ [kworker/3:2]
root 9422 0.0 0.0 0 0 ? S 07:39 0:00 \_ [kworker/1:0]
root 9423 0.0 0.0 0 0 ? S 07:39 0:00 \_ [kworker/1:3]
root 9459 0.0 0.0 0 0 ? S 07:40 0:00 \_ [kworker/6:0]
root 9687 0.0 0.0 0 0 ? S 07:45 0:00 \_ [kworker/1:2]
root 9690 0.0 0.0 0 0 ? S 07:45 0:00 \_ [kworker/6:2]
root 9709 0.0 0.0 0 0 ? S 07:47 0:00 \_ [kworker/7:1]
root 9866 0.0 0.0 0 0 ? S 07:48 0:00 \_ [kworker/5:2]
root 9919 0.0 0.0 0 0 ? S 07:50 0:00 \_ [kworker/6:1]
root 9920 0.0 0.0 0 0 ? S 07:50 0:00 \_ [kworker/0:2]
root 10061 0.0 0.0 0 0 ? S 07:52 0:00 \_ [kworker/7:0]
root 10115 0.0 0.0 0 0 ? S 07:54 0:00 \_ [kworker/2:2]
root 10148 0.0 0.0 0 0 ? S 07:56 0:00 \_ [kworker/0:0]
root 10289 0.0 0.0 0 0 ? S 07:57 0:00 \_ [kworker/1:1]
root 1 0.0 0.0 193908 6996 ? Ss Feb01 8:06 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
root 4420 0.0 0.0 114604 69104 ? Ss Feb01 2:57 /usr/lib/systemd/systemd-journald
root 4447 0.0 0.0 201080 8272 ? Ss Feb01 0:00 /usr/sbin/lvmetad -f
root 4451 0.0 0.0 47944 5356 ? Ss Feb01 0:00 /usr/lib/systemd/systemd-udevd
root 8482 0.0 0.0 62044 1260 ? S<sl Feb01 0:37 /sbin/auditd
dbus 8504 0.0 0.0 66524 2648 ? Ssl Feb01 4:04 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root 8506 0.0 0.0 21664 1320 ? Ss Feb01 1:53 /usr/sbin/irqbalance --foreground
root 8509 0.0 0.0 26376 1780 ? Ss Feb01 2:07 /usr/lib/systemd/systemd-logind
polkitd 8511 0.0 0.0 612972 13088 ? Ssl Feb01 1:01 /usr/lib/polkit-1/polkitd --no-debug
avahi 8512 0.0 0.0 60184 2204 ? Ss Feb01 0:13 avahi-daemon: running [ndmzlb5.local]
avahi 8514 0.0 0.0 60060 392 ? S Feb01 0:00 \_ avahi-daemon: chroot helper
root 8519 0.0 0.0 126276 1688 ? Ss Feb01 0:08 /usr/sbin/crond -n
root 8560 0.0 0.0 110084 856 tty1 Ss+ Feb01 0:00 /sbin/agetty --noclear tty1 linux
root 8568 0.0 0.0 358632 29420 ? Ssl Feb01 0:23 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
root 8571 0.0 0.0 563576 9524 ? Ssl Feb01 1:06 /usr/sbin/NetworkManager --no-daemon
root 9053 0.0 0.0 573812 19168 ? Ssl Feb01 1:06 /usr/bin/python2 -Es /usr/sbin/tuned -l -P
root 9054 0.0 0.0 112756 4356 ? Ss Feb01 0:17 /usr/sbin/sshd -D
root 10291 2.0 0.0 154564 5672 ? Ss 07:57 0:00 \_ sshd: root@pts/0
root 10294 0.1 0.0 115432 2104 pts/0 Ss 07:57 0:00 \_ -bash
root 10310 0.0 0.0 113172 1392 pts/0 S+ 07:58 0:00 \_ /bin/bash /usr/sbin/collectData
root 10313 0.0 0.0 155488 2012 pts/0 R+ 07:58 0:00 \_ ps axuf
root 9058 0.4 0.0 434312 38068 ? Ssl Feb01 70:41 /usr/sbin/rsyslogd -n
www 9060 0.0 0.0 19856 1420 ? Ss Feb01 0:26 /usr/local/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf
root 9512 0.0 0.0 89544 2168 ? Ss Feb01 0:14 /usr/libexec/postfix/master -w
postfix 9550 0.0 0.0 89716 4104 ? S Feb01 0:03 \_ qmgr -l -t unix -u
postfix 7073 0.0 0.0 89648 4080 ? S 06:49 0:00 \_ pickup -l -t unix -u
root 23922 0.0 0.0 82560 9500 ? Ss Feb08 0:00 /usr/local/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 23923 147 0.0 618196 67260 ? Rsl Feb08 5832:36 \_ /usr/local/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
23923
UID PID SPID PPID C SZ RSS PSR STIME TTY STAT TIME CMD
haproxy 23923 23923 23922 13 154549 67260 0 Feb08 ? Rsl 522:25 /usr/local/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 23923 23924 23922 14 154549 67260 1 Feb08 ? Rsl 586:30 /usr/local/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 23923 23925 23922 12 154549 67260 2 Feb08 ? Rsl 501:42 /usr/local/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 23923 23926 23922 12 154549 67260 3 Feb08 ? Rsl 503:07 /usr/local/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 23923 23927 23922 12 154549 67260 4 Feb08 ? Rsl 505:13 /usr/local/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 23923 23928 23922 54 154549 67260 5 Feb08 ? Rsl 2163:31 /usr/local/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 23923 23929 23922 13 154549 67260 6 Feb08 ? Rsl 529:43 /usr/local/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 23923 23930 23922 13 154549 67260 7 Feb08 ? Rsl 520:21 /usr/local/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
The connection table lists quite a lot connections having the status COLSE-WAIT (I replaced ip addresses with names and I removed lines otherwise the Post was to long. The table had about 350 Connection in State CLOSE-WAIT):
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 2 128 *:80 *:* users:(("haproxy",pid=23923,fd=7))
skmem:(r0,rb87380,t0,tb16384,f0,w0,o0,bl0,d1)
LISTEN 129 128 *:8080 *:* users:(("haproxy",pid=23923,fd=5))
skmem:(r0,rb87380,t0,tb16384,f0,w0,o0,bl0,d0)
LISTEN 0 128 *:22 *:* users:(("sshd",pid=9054,fd=3))
skmem:(r0,rb87380,t0,tb16384,f0,w0,o0,bl0,d1)
LISTEN 0 128 127.0.0.1:3000 *:* users:(("lighttpd",pid=9060,fd=3))
skmem:(r0,rb87380,t0,tb16384,f0,w0,o0,bl0,d0)
LISTEN 0 100 127.0.0.1:25 *:* users:(("master",pid=9512,fd=13))
skmem:(r0,rb87380,t0,tb16384,f0,w0,o0,bl0,d0)
LISTEN 129 128 *:443 *:* users:(("haproxy",pid=23923,fd=8)) timer:(keepalive,042ms,0)
skmem:(r0,rb87380,t0,tb16384,f0,w0,o0,bl0,d1591)
SYN-RECV 0 0 ndmzlb5:443 external-gateway1:47189 timer:(on,14sec,4)
SYN-RECV 0 0 ndmzlb5:443 external-gateway2:45381 timer:(on,14sec,4)
SYN-RECV 0 0 ndmzlb5:443 external-gateway2:45261 timer:(on,21sec,5)
SYN-RECV 0 0 ndmzlb5:443 external-gateway1:47074 timer:(on,21sec,5)
CLOSE-WAIT 122 0 ndmzlb5:443 <external-user-ip>:59554 users:(("haproxy",pid=23923,fd=1315))
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d1332)
CLOSE-WAIT 1 0 ndmzlb5:58462 backend-server1:30080 users:(("haproxy",pid=23923,fd=1001))
skmem:(r768,rb367360,t0,tb87040,f3328,w0,o0,bl0,d0)
... [ REMOVED LINES DUE TO LENGTH ] ...
CLOSE-WAIT 180 0 ndmzlb5:443 <external-user-ip>:46412 users:(("haproxy",pid=23923,fd=145))
skmem:(r2304,rb369280,t0,tb92160,f1792,w0,o0,bl0,d821)
CLOSE-WAIT 46173 0 ndmzlb5:46730 backend-server2:30080 users:(("haproxy",pid=23923,fd=37))
skmem:(r78336,rb367360,t0,tb87040,f3584,w0,o0,bl0,d1)
CLOSE-WAIT 1 0 ndmzlb5:443 <external-user-ip>:56896 users:(("haproxy",pid=23923,fd=413))
skmem:(r768,rb369280,t0,tb46080,f3328,w0,o0,bl0,d824)
CLOSE-WAIT 135 0 ndmzlb5:8080 monitoring-system:49698
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d0)
CLOSE-WAIT 1859 0 ndmzlb5:443 <external-user-ip>:53581 users:(("haproxy",pid=23923,fd=896))
skmem:(r4608,rb369280,t0,tb165888,f3584,w0,o0,bl0,d821)
CLOSE-WAIT 518 0 ndmzlb5:443 external-gateway1:49977
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d1591)
CLOSE-WAIT 396 0 ndmzlb5:443 <external-user-ip>:33194 users:(("haproxy",pid=23923,fd=680))
skmem:(r2304,rb369280,t0,tb46080,f1792,w0,o0,bl0,d822)
CLOSE-WAIT 1 0 ndmzlb5:443 <external-user-ip>:37770 users:(("haproxy",pid=23923,fd=35))
skmem:(r768,rb369280,t0,tb46080,f3328,w0,o0,bl0,d1591)
CLOSE-WAIT 761 0 ndmzlb5:443 <external-user-ip>:52049 users:(("haproxy",pid=23923,fd=131))
skmem:(r4608,rb369280,t0,tb50688,f3584,w0,o0,bl0,d823)
CLOSE-WAIT 1 0 ndmzlb5:52984 backend-server3:30080 users:(("haproxy",pid=23923,fd=285))
skmem:(r768,rb367360,t0,tb87040,f3328,w0,o0,bl0,d0)
CLOSE-WAIT 518 0 ndmzlb5:443 external-gateway1:51424
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d1591)
CLOSE-WAIT 135 0 ndmzlb5:8080 monitoring-system:33088
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d0)
CLOSE-WAIT 1 0 ndmzlb5:37162 backend-server3:30080 users:(("haproxy",pid=23923,fd=60))
skmem:(r768,rb367360,t0,tb87040,f3328,w0,o0,bl0,d0)
CLOSE-WAIT 135 0 ndmzlb5:8080 monitoring-system:42794
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d0)
CLOSE-WAIT 413 0 ndmzlb5:443 <external-user-ip>:58701 users:(("haproxy",pid=23923,fd=1042))
skmem:(r2304,rb369280,t0,tb50688,f1792,w0,o0,bl0,d822)
CLOSE-WAIT 146 0 ndmzlb5:8080 monitoring-system:39492
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d0)
CLOSE-WAIT 518 0 ndmzlb5:443 external-gateway2:49168
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d1591)
CLOSE-WAIT 135 0 ndmzlb5:8080 monitoring-system:35044
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d0)
CLOSE-WAIT 518 0 ndmzlb5:443 external-gateway2:50850
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d1591)
CLOSE-WAIT 135 0 ndmzlb5:8080 monitoring-system:60970
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d0)
CLOSE-WAIT 1 0 ndmzlb5:37186 backend-server3:30080 users:(("haproxy",pid=23923,fd=43))
skmem:(r768,rb367360,t0,tb87040,f3328,w0,o0,bl0,d0)
CLOSE-WAIT 518 0 ndmzlb5:443 external-gateway1:51177
skmem:(r2304,rb369280,t0,tb87040,f1792,w0,o0,bl0,d1591)
LISTEN 0 128 :::22 :::* users:(("sshd",pid=9054,fd=4))
skmem:(r0,rb87380,t0,tb16384,f0,w0,o0,bl0,d0)
LISTEN 0 100 ::1:25 :::* users:(("master",pid=9512,fd=14))
skmem:(r0,rb87380,t0,tb16384,f0,w0,o0,bl0,d0)
I attached strace for about a minute to the haproxy which produced a file of 171MB in size but i contains only repetitive entries as the following until restart of the HAProxy process:
23929 07:59:03.354485 <... sched_yield resumed> ) = 0
23923 07:59:03.358101 sched_yield( <unfinished ...>
23930 07:59:03.358128 sched_yield( <unfinished ...>
23929 07:59:03.358144 <... sched_yield resumed> ) = 0
23925 07:59:03.358172 sched_yield( <unfinished ...>
23924 07:59:03.358194 <... sched_yield resumed> ) = 0
23923 07:59:03.358213 <... sched_yield resumed> ) = 0
23930 07:59:03.358250 <... sched_yield resumed> ) = 0
23929 07:59:03.358271 sched_yield( <unfinished ...>
23924 07:59:03.358300 sched_yield( <unfinished ...>
23923 07:59:03.358322 sched_yield( <unfinished ...>
23929 07:59:03.358362 <... sched_yield resumed> ) = 0
23927 07:59:03.358392 sched_yield( <unfinished ...>
23926 07:59:03.358415 sched_yield( <unfinished ...>
23924 07:59:03.358433 <... sched_yield resumed> ) = 0
23923 07:59:03.358453 <... sched_yield resumed> ) = 0
23930 07:59:03.358498 sched_yield( <unfinished ...>
23929 07:59:03.358518 sched_yield( <unfinished ...>
23927 07:59:03.358546 <... sched_yield resumed> ) = 0
23926 07:59:03.358570 <... sched_yield resumed> ) = 0
23924 07:59:03.358589 sched_yield( <unfinished ...>
23923 07:59:03.358608 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=23922, si_uid=0} ---
23930 07:59:03.364157 <... sched_yield resumed> ) = ? <unavailable>
23929 07:59:03.364227 +++ killed by SIGTERM +++
23930 07:59:03.364239 +++ killed by SIGTERM +++
23928 07:59:03.364247 +++ killed by SIGTERM +++
23927 07:59:03.364256 +++ killed by SIGTERM +++
23925 07:59:03.364264 +++ killed by SIGTERM +++
23924 07:59:03.364273 +++ killed by SIGTERM +++
23926 07:59:03.371111 +++ killed by SIGTERM +++
23923 07:59:03.371155 +++ killed by SIGTERM +++
I also checked our backend systems and found on one system shortly before the HAProxy stopped responding (couple of seconds) that this system logged a reset of its network interface:
Feb 10 23:53:02 backend-server3 kernel: e1000 0000:02:02.0 eno16780032: Reset adapter
Feb 10 23:53:02 backend-server3 kernel: e1000: eno16780032 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Feb 10 23:53:02 backend-server3 NetworkManager[864]: <info> [1549838222.3110] device (eno16780032): link connected
On Friday there is no reset of a network interface on any of the backend hosts or on the HAProxy, but the cpu utilization drops not under ~ 19%. I think the reset of the interface on Saturday is linked to the behaviour to the crash of the HAProxy on ndmzlb5 maybe the process stucks in a loop and can’t return.
Any idea what I can do next or what causes this behaviour?