Haproxy slowing down after several days of uptime

Summary of issue: After several days of run-time, say 5… the haproxy service starts to slow down in ways not easily detected. By that I mean, that system load average is typical, system memory has over 1G of free space (which is probably a bad sign actually), tcp_mem and things all have available buffer space.

The service haproxy front-ends for keeps track of how many bytes transfer per day. This number crashes as the only obvious symptom. As soon as haproxy is reloaded, the numbers skyrocket back up. We haven’t identified any metrics (running the whole munin-node suite plus some additional checks like TCP retransmits and things) that would indicate a problem with the server, or a race condition, but nonetheless, the event repeats and is resolved by a reload or restart.

What appears from the log… the number of failed back-end checks jumps dramatically (due to timeout) during the issue, but there are no issues with the backend… and reload haproxy resolves it.

root@:/tmp# grep -ic warning haproxy.log.2
825
root@:/tmp# grep -ic warning haproxy.log.1
19
root@:/tmp# grep -ic warning haproxy.log
28
root@:/tmp# grep -ic warning haproxy.log.3
14
root@:/tmp# grep -ic warning haproxy.log.4
22
Mar  6 01:55:47  haproxy[30453]: [WARNING] 064/015547 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:55:47  haproxy[30453]: [WARNING] 064/015547 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:55:47  haproxy[30453]: [WARNING] 064/015547 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:55:47  haproxy[30453]: [WARNING] 064/015547 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:55:49  haproxy[30453]: [WARNING] 064/015549 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 43ms, status: 3/3 UP.
Mar  6 01:55:49  haproxy[30453]: [WARNING] 064/015549 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 89ms, status: 3/3 UP.
Mar  6 01:55:49  haproxy[30453]: [WARNING] 064/015549 (30579) : Health check for server nodes/www6.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 495ms, status: 3/3 UP.
Mar  6 01:55:51  haproxy[30453]: [WARNING] 064/015551 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:55:53  haproxy[30453]: [WARNING] 064/015553 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2002ms, status: 2/3 UP.
Mar  6 01:55:53  haproxy[30453]: [WARNING] 064/015553 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:55:53  haproxy[30453]: [WARNING] 064/015553 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:55:53  haproxy[30453]: [WARNING] 064/015553 (30579) : Health check for server nodes/www5.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 177ms, status: 3/3 UP.
Mar  6 01:55:55  haproxy[30453]: [WARNING] 064/015555 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 132ms, status: 3/3 UP.
Mar  6 01:55:55  haproxy[30453]: [WARNING] 064/015555 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 395ms, status: 3/3 UP.
Mar  6 01:55:57  haproxy[30453]: [WARNING] 064/015557 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 1/3 UP.
Mar  6 01:55:57  haproxy[30453]: [WARNING] 064/015557 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:55:59  haproxy[30453]: [WARNING] 064/015559 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:55:59  haproxy[30453]: [WARNING] 064/015559 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:56:00  haproxy[30453]: [WARNING] 064/015600 (30579) : Health check for server nodes/www5.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 632ms, status: 3/3 UP.
Mar  6 01:56:00  haproxy[30453]: [WARNING] 064/015600 (30579) : Health check for server nodes/www6.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 915ms, status: 3/3 UP.
Mar  6 01:56:01  haproxy[30453]: [WARNING] 064/015601 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 308ms, status: 3/3 UP.
Mar  6 01:56:01  haproxy[30453]: [WARNING] 064/015601 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 753ms, status: 3/3 UP.
Mar  6 01:56:05  haproxy[30453]: [WARNING] 064/015605 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:56:05  haproxy[30453]: [WARNING] 064/015605 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:56:06  haproxy[30453]: [WARNING] 064/015606 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:56:06  haproxy[30453]: [WARNING] 064/015606 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:56:07  haproxy[30453]: [WARNING] 064/015607 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 121ms, status: 3/3 UP.
Mar  6 01:56:08  haproxy[30453]: [WARNING] 064/015608 (30579) : Health check for server nodes/www6.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 96ms, status: 3/3 UP.
Mar  6 01:56:09  haproxy[30453]: [WARNING] 064/015609 (30579) : Health check for server nodes/www5.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 401ms, status: 3/3 UP.
Mar  6 01:56:09  haproxy[30453]: [WARNING] 064/015609 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:56:12  haproxy[30453]: [WARNING] 064/015612 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 60ms, status: 3/3 UP.
Mar  6 01:56:16  haproxy[30453]: [WARNING] 064/015616 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:56:20  haproxy[30453]: [WARNING] 064/015620 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:56:22  haproxy[30453]: [WARNING] 064/015622 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 56ms, status: 3/3 UP.
Mar  6 01:56:23  haproxy[30453]: [WARNING] 064/015623 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:56:25  haproxy[30453]: [WARNING] 064/015625 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 541ms, status: 3/3 UP.
Mar  6 01:56:27  haproxy[30453]: [WARNING] 064/015627 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:56:29  haproxy[30453]: [WARNING] 064/015629 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:56:31  haproxy[30453]: [WARNING] 064/015631 (30579) : Health check for server nodes/www5.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 1368ms, status: 3/3 UP.
Mar  6 01:56:31  haproxy[30453]: [WARNING] 064/015631 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 48ms, status: 3/3 UP.
Mar  6 01:56:49  haproxy[30453]: [WARNING] 064/015649 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:56:53  haproxy[30453]: [WARNING] 064/015653 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:56:54  haproxy[30453]: [WARNING] 064/015654 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:56:54  haproxy[30453]: [WARNING] 064/015654 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:56:55  haproxy[30453]: [WARNING] 064/015655 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 222ms, status: 3/3 UP.
Mar  6 01:56:57  haproxy[30453]: [WARNING] 064/015657 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 298ms, status: 3/3 UP.
Mar  6 01:56:57  haproxy[30453]: [WARNING] 064/015657 (30579) : Health check for server nodes/www6.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 308ms, status: 3/3 UP.
Mar  6 01:57:04  haproxy[30453]: [WARNING] 064/015704 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:05  haproxy[30453]: [WARNING] 064/015705 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:05  haproxy[30453]: [WARNING] 064/015705 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2019ms, status: 2/3 UP.
Mar  6 01:57:07  haproxy[30453]: [WARNING] 064/015707 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 45ms, status: 3/3 UP.
Mar  6 01:57:07  haproxy[30453]: [WARNING] 064/015707 (30579) : Health check for server nodes/www6.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 420ms, status: 3/3 UP.
Mar  6 01:57:08  haproxy[30453]: [WARNING] 064/015708 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:57:10  haproxy[30453]: [WARNING] 064/015710 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 95ms, status: 3/3 UP.
Mar  6 01:57:13  haproxy[30453]: [WARNING] 064/015713 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:13  haproxy[30453]: [WARNING] 064/015713 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:13  haproxy[30453]: [WARNING] 064/015713 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:14  haproxy[30453]: [WARNING] 064/015714 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:15  haproxy[30453]: [WARNING] 064/015715 (30579) : Health check for server nodes/www5.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 45ms, status: 3/3 UP.
Mar  6 01:57:15  haproxy[30453]: [WARNING] 064/015715 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 59ms, status: 3/3 UP.
Mar  6 01:57:15  haproxy[30453]: [WARNING] 064/015715 (30579) : Health check for server nodes/www6.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 62ms, status: 3/3 UP.
Mar  6 01:57:17  haproxy[30453]: [WARNING] 064/015717 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 382ms, status: 3/3 UP.
Mar  6 01:57:23  haproxy[30453]: [WARNING] 064/015723 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:23  haproxy[30453]: [WARNING] 064/015723 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:23  haproxy[30453]: [WARNING] 064/015723 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:23  haproxy[30453]: [WARNING] 064/015723 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:57:27  haproxy[30453]: [WARNING] 064/015727 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:57:27  haproxy[30453]: [WARNING] 064/015727 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:57:27  haproxy[30453]: [WARNING] 064/015727 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:57:27  haproxy[30453]: [WARNING] 064/015727 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:57:29  haproxy[30453]: [WARNING] 064/015729 (30579) : Health check for server nodes/www5.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 214ms, status: 3/3 UP.
Mar  6 01:57:29  haproxy[30453]: [WARNING] 064/015729 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 196ms, status: 3/3 UP.
Mar  6 01:57:29  haproxy[30453]: [WARNING] 064/015729 (30579) : Health check for server nodes/www6.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 425ms, status: 3/3 UP.
Mar  6 01:57:29  haproxy[30453]: [WARNING] 064/015729 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 591ms, status: 3/3 UP.
Mar  6 01:57:40  haproxy[30453]: [WARNING] 064/015740 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:42  haproxy[30453]: [WARNING] 064/015742 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 50ms, status: 3/3 UP.
Mar  6 01:57:46  haproxy[30453]: [WARNING] 064/015746 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:57:50  haproxy[30453]: [WARNING] 064/015750 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:57:52  haproxy[30453]: [WARNING] 064/015752 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 75ms, status: 3/3 UP.
Mar  6 01:58:00  haproxy[30453]: [WARNING] 064/015800 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:58:01  haproxy[30453]: [WARNING] 064/015801 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2004ms, status: 2/3 UP.
Mar  6 01:58:02  haproxy[30453]: [WARNING] 064/015802 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:58:02  haproxy[30453]: [WARNING] 064/015802 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:58:02  haproxy[30453]: [WARNING] 064/015802 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 70ms, status: 3/3 UP.
Mar  6 01:58:03  haproxy[30453]: [WARNING] 064/015803 (30579) : Health check for server nodes/www6.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 85ms, status: 3/3 UP.
Mar  6 01:58:04  haproxy[30453]: [WARNING] 064/015804 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 74ms, status: 3/3 UP.
Mar  6 01:58:04  haproxy[30453]: [WARNING] 064/015804 (30579) : Health check for server nodes/www5.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 80ms, status: 3/3 UP.
Mar  6 01:58:09  haproxy[30453]: [WARNING] 064/015809 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:58:09  haproxy[30453]: [WARNING] 064/015809 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:58:10  haproxy[30453]: [WARNING] 064/015810 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:58:10  haproxy[30453]: [WARNING] 064/015810 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:58:13  haproxy[30453]: [WARNING] 064/015813 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 1/3 UP.
Mar  6 01:58:13  haproxy[30453]: [WARNING] 064/015813 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 1/3 UP.
Mar  6 01:58:14  haproxy[30453]: [WARNING] 064/015814 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:58:14  haproxy[30453]: [WARNING] 064/015814 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 1/3 UP.
Mar  6 01:58:15  haproxy[30453]: [WARNING] 064/015815 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 68ms, status: 3/3 UP.
Mar  6 01:58:15  haproxy[30453]: [WARNING] 064/015815 (30579) : Health check for server nodes/www6.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 121ms, status: 3/3 UP.
Mar  6 01:58:18  haproxy[30453]: [WARNING] 064/015818 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 0/2 DOWN.
Mar  6 01:58:18  haproxy[30453]: [WARNING] 064/015818 (30579) : Server nodes/www4.xxx.com is DOWN. 3 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
Mar  6 01:58:18  haproxy[30453]: [WARNING] 064/015818 (30579) : Health check for server nodes/www5.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 0/2 DOWN.
Mar  6 01:58:18  haproxy[30453]: [WARNING] 064/015818 (30579) : Server nodes/www5.xxx.com is DOWN. 2 active and 0 backup servers left. 6 sessions active, 0 requeued, 0 remaining in queue.
Mar  6 01:58:21  haproxy[30453]: [WARNING] 064/015821 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 1444ms, status: 1/2 DOWN.
Mar  6 01:58:22  haproxy[30453]: [WARNING] 064/015822 (30579) : Health check for server nodes/www6.xxx.com failed, reason: Layer7 timeout, check duration: 2001ms, status: 2/3 UP.
Mar  6 01:58:22  haproxy[30453]: [WARNING] 064/015822 (30579) : Health check for server nodes/www3.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 2/3 UP.
Mar  6 01:58:25  haproxy[30453]: [WARNING] 064/015825 (30579) : Health check for server nodes/www4.xxx.com failed, reason: Layer7 timeout, check duration: 2000ms, status: 0/2 DOWN.
Mar  6 01:58:26  haproxy[30453]: [WARNING] 064/015826 (30579) : Health check for server nodes/www6.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 1864ms, status: 3/3 UP.
Mar  6 01:58:26  haproxy[30453]: [WARNING] 064/015826 (30579) : Health check for server nodes/www5.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 1941ms, status: 1/2 DOWN.
Mar  6 01:58:26  haproxy[30453]: [WARNING] 064/015826 (30579) : Health check for server nodes/www3.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 1950ms, status: 3/3 UP.
Mar  6 01:58:27  haproxy[30453]: [WARNING] 064/015827 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 55ms, status: 1/2 DOWN.
Mar  6 01:58:29  haproxy[30453]: [WARNING] 064/015829 (30579) : Health check for server nodes/www5.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 848ms, status: 3/3 UP.
Mar  6 01:58:29  haproxy[30453]: [WARNING] 064/015829 (30579) : Server nodes/www5.xxx.com is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Mar  6 01:58:29  haproxy[30453]: [WARNING] 064/015829 (30579) : Health check for server nodes/www4.xxx.com succeeded, reason: Layer7 check passed, code: 200, check duration: 167ms, status: 3/3 UP.

We even have a script that runs every 15 minutes that uploads and downloads 100MB that doesn’t seem to show this issue.

Below is the haproxy compile info and all of the configs, I know that doesn’t help much.

 haproxy -vvv
HA-Proxy version 2.2.8-7bf78d7 2021/01/13 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.8.html
Running on: Linux 4.15.0-130-generic #134-Ubuntu SMP Tue Jan 5 20:46:26 UTC 2021 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-stringop-overflow -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_SYSTEMD=1
  DEBUG   =

Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE -PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=2).
Built with OpenSSL version : OpenSSL 1.1.1  11 Sep 2018
Running on OpenSSL version : OpenSSL 1.1.1  11 Sep 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 7.5.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
              h2 : mode=HTTP       side=FE|BE     mux=H2
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services : none

Available filters :
        [SPOE] spoe
        [COMP] compression
        [TRACE] trace
        [CACHE] cache
        [FCGI] fcgi-app

sysctl adjustments

net.core.wmem_default=41943040
net.core.rmem_default=41943040
net.core.wmem_max    =512000000
net.core.rmem_max    =512000000
net.core.somaxconn=60000
net.netfilter.nf_conntrack_log_invalid = 255
net.ipv4.netfilter.ip_conntrack_log_invalid = 255
net.ipv4.tcp_max_syn_backlog = 100000
net.core.somaxconn = 100000
net.core.netdev_max_backlog = 100000
net.ipv4.ip_local_port_range=1024 65535
net.ipv4.icmp_ratelimit=0
net.ipv6.icmp.ratelimit=0

net.ipv4.tcp_rmem=131072 41943040 1094304000
net.ipv4.tcp_wmem=131072 41943040 3094304000
net.ipv4.tcp_mem=41943040 123943040 419430400
vm.swappiness=90
net.ipv4.ip_local_port_range=1024 60999
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_synack_retries = 3
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_reuse = 1
net.core.somaxconn = 60000
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_keepalive_time=300
net.ipv4.tcp_keepalive_probes=3

net.core.optmem_max=1020000
net.ipv4.tcp_slow_start_after_idle=0
net.ipv4.tcp_fack=1
net.ipv4.tcp_congestion_control=cubic
net.ipv4.tcp_frto=0
./sanitizeconfig.sh
global
        maxconn         100000
        log /dev/log    local0
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 777 level admin expose-fd listeners
        stats timeout 30s
        user haproxy
        group haproxy
        daemon
        nbproc 1
        nbthread 6
        hard-stop-after 24h


        tune.h2.initial-window-size 4096000

defaults
        log     global
        mode    http
        option forwardfor
        option redispatch
        option log-separate-errors

        timeout client 8h
        timeout server 8h
        timeout connect 60s

        log-format "%ci:%cp [%tr] %ft %b/%s %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r %sslc"

        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http

listen statsend
        bind :9000
        mode http
        stats enable
        stats hide-version
        stats scope .
        stats realm Haproxy\ Statistics
        stats uri /haproxy-stats?stats


cache objects
        total-max-size 1024
        max-object-size 2560000
        max-age 86400

frontend cp.xxx.com
     bind *:80
     bind *:443 ssl crt /etc/apache2/sites-available/xxx.com/ssl/le/ssl-certs.pem ssl-min-ver TLSv1.2  alpn h2,h2c,http/1.1
     maxconn 50000
     compression algo gzip
     compression type text/html text/plain text/javascript application/javascript application/xml text/css
     option forwardfor
     option http-keep-alive
     timeout client     8h
     timeout http-keep-alive 60s
     timeout http-request 60s
     timeout client-fin 60s
     http-request cache-use objects
     http-response cache-store objects
     http-request set-header X-Forwarded-Port %[dst_port]
     http-request add-header X-Forwarded-Proto https if { ssl_fc }
     option http-server-close
     capture request header Referrer len 64
     capture request header Content-Length len 10
     capture request header User-Agent len 64
     http-request add-header  Strict-Transport-Security  max-age=15768000
     http-request redirect scheme https unless { ssl_fc }
     default_backend nodes


backend nodes
    mode http
    hash-type consistent
    option redispatch
    fullconn 40000



    option httpchk GET /index.php
    http-check expect status 200




    retry-on empty-response conn-failure

    option log-health-checks

    balance leastconn

    cookie WSRVID insert indirect nocache maxidle 30m maxlife 24h


    server www3.xxx.com 127.0.0.5:443 check check-ssl ssl force-tlsv13 verify none alpn h2 cookie s3 maxconn 10000 allow-0rtt sni ssl_fc_sni check-alpn http/1.1
    server www4.xxx.com 127.0.0.6:443 check check-ssl ssl force-tlsv13 verify none alpn h2 cookie s4 maxconn 10000 allow-0rtt sni ssl_fc_sni check-alpn http/1.1
    server www5.xxx.com 127.0.0.7:443 check check-ssl ssl verify none force-tlsv13 sni ssl_fc_sni allow-0rtt alpn h2 cookie s5 maxconn 10000 check-alpn http/1.1
    server www6.xxx.com 127.0.0.8:443 check check-ssl ssl verify none force-tlsv13 sni ssl_fc_sni allow-0rtt alpn h2 cookie s6 maxconn 10000 check-alpn http/1.1

All of your valuable advice is deeply appreciated, even writing this to the community put me on the track of the timeouts.

Thanks in advance!

Periodically check how your sockets numbers look like, for example with ss -a | awk '{print $1}' | sort | uniq -c (copied from a recent thread). Check for dmesg entries (specifically conntrack). Also check adjacent firewalls (they may rate limit active connections that are freed up when you ).

You can also check haproxy numbers from the stats socket (html or csv).

Logging or graphing all those numbers will help understand the trend and will give a clearer picture of where the anomalies are.

Thanks. We’ve started graphing our socket utilizations. Turns out 2.2.8 has some bugs that were fixed relating to CLOSE_WAITs in HTTP/2… but 2.2.10 hangs on us unexpectedly… so when we are feeling brave we will try 2.2.10 and track that set of issues.

After further investigating, it appears our issues are related to http/2.0 support in haproxy. We’ve disabled it both on the front end and the backend and are testing to see if our problems magically vanish.

Some problems we are tracking:

  1. large downloads (1G or more seem to periodically stall out at 99% or so but only for a subset of connections).

  2. In 2.2.10, we get haproxy hanging within a few hours…

  3. Throughput seems lower in http/2.0 than 1.1. As soon as we turned off http/2.0 connections, our apparent usage tripled.

  4. After several days of run-time, all of our backends start being declared down due to 2s timeout and then immediately (next two seconds) back up and so forth… this is what the OP was about.

[We don’t have any other specifics on how to demonstrate this, but putting this here so others may benefit from our experience if they are having unexplained slow downs, etc]

Best

Whenever health checks start flapping like this for us it’s when the site is being slammed with requests and we’re hitting maxconn on the backend servers or something. I know you said that a reload fixes this issue but I’m still curious what the backend servers look like when this is happening. Do you graph things like HTTP request rate, session rate, etc. per fe/be? These stats have been super helpful for stuff like this.

What does swap usage look like when this happens? If the system is swapping I could see issues like this popping up after several days and being resolved with a reload. You have vm.swappiness set to 90 which the opposite of what’s recommended for servers (10 or below).

In fact, I’d recommend scrutinizing every one of your sysctl adjustments if you haven’t already. I did this recently and discovered that many of the common adjustments that get copy/pasted everywhere were no longer required or recommended on newer kernels, like the wmem/rmem stuff.

I reduced our sysctl adjustments down to the following:

vm.swappiness=10
net.netfilter.nf_conntrack_max=4194304
net.netfilter.nf_conntrack_tcp_timeout_established=14400
net.ipv4.tcp_tw_reuse=1
net.ipv4.ip_local_port_range=1024 65535
net.core.somaxconn=10240
net.core.netdev_max_backlog=10240
net.ipv4.conf.all.rp_filter=1
net.ipv4.tcp_max_syn_backlog=10240
net.ipv4.tcp_synack_retries=3
net.ipv4.tcp_syncookies=1

We’re doing over 2.3Gbps, 1000 req/s, and over 3K average sessions per server without issue. We have a similar config, run 2.2.9, also do large downloads and uploads, and have never had an issue with HTTP/2 so I’m suspicious that something else is going on with your setup.

When haproxy is hanging, what do you observe on the box? Especially how is the CPU usage and what are the socket numbers looking like?

thanks for all of the feedback!

We disabled HTTP/2.0 and moved to 2.2.10. We were doing daily reloads and have backed that down to every 48 hours with no hangs since.

All of our problems with Haproxy thus far are possibly linked to HTTP/2.0. When we exposed the backend apache servers directly to the web (bypassing HA) performance is high. When (over the 6+ months we’ve been using HAproxy) we use it, the performance goes down.

Since we’ve turned off HTTP/2.0 and gone to HTTP/1.1… total throughput is way up even though max-conns is way down. (used to be 200-300 max front end/back end connections now we are around 100).
Which is counterintuitive, so we are watching it.

Whatever issue we are seeing as HTTP/2.0… isn’t impacting everyone, some days we get massive performance and some days we don’t. We’ve always been reluctant to blame that, but it might be a function of maturity of the code. We had to tweak window sizes and other things to improve performance as well.

Anyway… We are graphing almost every aspect of the servers performance now. When the system hangs we don’t see a massive increase in socket or file handle usage (or really any) or CPU load… Even though we mess with tcp_mem settings, linux is freely allocating more than the tcp_mem_max so I don’t even know what that’s about. Some days its over 1GB of RAM.

We’re going to keep monitoring it and stretching our window between reloads to see if we can identify any open issues, but we probably aren’t going back to HTTP/2.0 for a while.

Thanks!

Just following up here –

After this thread… went back to haproxy 2.2.10. Reduced the number of sysctl settings, increased the system ram to 8GB from 4GB. Re-enabled HTTP/2.0…

Things have been running pretty smoothly for about 5 days… which is more than we were getting.

When looking at the RAM issue – which is what I’m attributing it to – our graphs show we only popped into swap when “inactive” memory went high. Even now top shows only 400MB “used” but including inactive memory and things we are at about 4.5GB.

I always thought of inactive memory / cache / etc as things designed for the kernel to push out of memory … since… they are basically not part of the active processing queue.

I could be wrong, but if the memory solved our issue, this is clearly not the case for haproxy.

So provisionally, please just consider this a thank you for the advice and an apology for being hard headed about something like this!

Best,

haproxy-tcp_monitor-week

If you take a look at this graph, you’ll see the CLOSE_WAIT problem is not gone… it reappeared when we started under 2.2.10 with HTTP/2.0. The number is under control at 30 connections, but some of them have been left for over 4 hours since the last valid request.

A connection that stays in CLOSE_WAIT is a haproxy bug, there is no other explanation.

Could you try 2.2.11 and if that exhibits the same behavior, file a bug on Github?

Thanks

sure. thanks