Hi. A mobile app has the following connection chain:
mobile app (https) → AWS NLB (tcp/443) → haproxy on EC2 instance (https) → AWS ALB (http) → backend on AWS Fargate
I’ve discovered that sometimes while haproxy is sending traffic to the ALB it pauses the transfer for an unknown reason for several seconds (or even several tens of seconds):
(can share the full pcap dump if needed)
It happens only for a fraction of requests and mostly during high peaks of customer traffic.
Here is the relevant haproxy and ALB logs:
Sep 3 15:20:31 141.214.x.y:13971 [03/Sep/2024:15:20:22.467] lb-useast~ apielb_backend/api-elb3 0/1736/0/0/2/7662/9400 ---- 60/60/6/3/0 0/0 "POST https://domain.com/ls/apiLocation/location HTTP/2.0" 200 {|||||domain.com||549|Mozilla/5.0 (iPhone; CPU iPhone OS 17_6_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko} ireq_size=958 resp_size=123 172.31.93.169:4443 3.229.x.y:443 ECDHE-RSA-AES256-GCM-SHA384 TLSv1.2
https 2024-09-03T15:20:31.868632Z app/ecs-prod/fa267007e 54.156.x.y:48588 172.31.80.84:80 7.655 0.005 0.000 200 200 1068 143 "POST https://domain.com:443/ls/apiLocation/location HTTP/1.1" "Mozilla/5.0 (iPhone; CPU iPhone OS 17_6_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) [Optional(VirtueTrack) 1.118]" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-1:2770xxxxx:targetgroup/ecs-prod-fg-locationserver/dfa1459540 "Root=1-66d72938-062b422f6a970fb4" "-" "session-reused" 335 2024-09-03T15:20:24.207000Z "waf,forward" "-" "-" "172.31.80.84:80" "200" "-" "-" TID_9c7aa4736266674d823
Please, help me find a reason for that and fix.
# uname -r
4.14.350-266.564.amzn2.x86_64
# haproxy -vv
HAProxy version 3.0.3-95a607c 2024/07/11 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2029.
Known bugs: http://www.haproxy.org/bugs/bugs-3.0.3.html
Running on: Linux 4.14.350-266.564.amzn2.x86_64 #1 SMP Sat Aug 10 09:56:03 UTC 2024 x86_64
Build options :
TARGET = linux-glibc
CC = cc
CFLAGS = -O2 -g -fwrapv
OPTIONS = USE_THREAD=1 USE_LINUX_TPROXY=1 USE_OPENSSL=1 USE_ZLIB=1 USE_TFO=1 USE_NS=1 USE_SYSTEMD=1 USE_PROMEX=1 USE_PCRE=1 USE_PCRE_JIT=1
DEBUG =
Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY -LUA -MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT +PCRE -PCRE2 -PCRE2_JIT +PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN -SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL +ZLIB
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=2).
Built with OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 7.3.1 20180712 (Red Hat 7.3.1-17)
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE|BE mux=H2 flags=HTX|HOL_RISK|NO_UPG
<default> : mode=HTTP side=FE|BE mux=H1 flags=HTX
h1 : mode=HTTP side=FE|BE mux=H1 flags=HTX|NO_UPG
fcgi : mode=HTTP side=BE mux=FCGI flags=HTX|HOL_RISK|NO_UPG
<default> : mode=TCP side=FE|BE mux=PASS flags=
none : mode=TCP side=FE|BE mux=PASS flags=NO_UPG
Available services : prometheus-exporter
Available filters :
[BWLIM] bwlim-in
[BWLIM] bwlim-out
[CACHE] cache
[COMP] compression
[FCGI] fcgi-app
[SPOE] spoe
[TRACE] trace
HAProxy config (meaningful parts):
global
maxconn 25000
daemon
master-worker
set-var proc.max_conn_cur int(500)
set-var proc.max_conn_rate int(2000)
set-var proc.max_http_err_rate int(200)
set-var proc.max_http_req_rate int(5000)
ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-
GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
ssl-default-server-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES25
6-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
defaults
mode http
log global
option forwardfor except 127.0.0.0/8
option dontlognull # don't log sessions even if no data exchange happened
option splice-auto # accelerate performance with kernel tcp splicing options
option httplog # enable logging of HTTP request, session state and timers
option http-server-close # operate in http-close mode
option redispatch # allow switching to another backend server when the one in the cookie gets down
option contstats # enable continuous traffic statistics updates
retries 3
backlog 25000 # correlates with maxconn
timeout client 60s # was 120
timeout client-fin 15s # was 25
timeout connect 5s
timeout server 1h # was 60s
timeout tunnel 1h
timeout http-keep-alive 10s # was 1
timeout http-request 5s # was 15
timeout queue 30s
timeout tarpit 60s
timeout check 5s
default-server inter 6s rise 1 fall 3
log-format %ci:%cp\ [%t]\ %ft\ %b/%s\ %Th/%Ti/%TR/%Tw/%Tc/%Tr/%Tt\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %{+Q}r\ %ST\ "%hr"\ ireq_size=%U\ resp_size=%B\ %fi:%fp\ %si:%sp\ %sslc\ %sslv
frontend lb-useast
mode http
maxconn 8000
bind *:4080 name lb-useast_frontend_http
bind *:4443 name lb-useast_frontend ssl crt-list /mnt/s3vol-common/crt/crt-list.txt
tcp-request connection track-sc0 src
stick-table type ip size 500k expire 30s store conn_cur,conn_rate(10s),http_req_rate(10s),http_err_rate(10s)
tcp-request connection reject if { src -f /etc/haproxy/blacklist.lst }
# CVE-2023-25725 workaround
http-request deny if { fc_http_major 1 } !{ req.body_size 0 } !{ req.hdr(content-length) -m found } !{ req.hdr(transfer-encoding) -m found }
# Reject the new connection if the client already has proc.max_conn_cur opened
http-request add-header X-Haproxy-ACL %[req.fhdr(X-Haproxy-ACL,-1)]over-%[var(proc.max_conn_cur)]-active-connections, if { sc0_conn_cur,sub(proc.max_conn_cur) gt 0 }
http-request capture sc0_conn_cur len 4 if { sc0_conn_cur,sub(proc.max_conn_cur) gt 0 }
# Reject the new connection if the client has opened more than proc.max_conn_rate connections in 10 seconds
http-request add-header X-Haproxy-ACL %[req.fhdr(X-Haproxy-ACL,-1)]over-%[var(proc.max_conn_rate)]-connections-in-10-seconds, if { sc0_conn_rate,sub(proc.max_conn_rate) gt 0 }
http-request capture sc0_conn_rate len 4 if { sc0_conn_rate,sub(proc.max_conn_rate) gt 0 }
# Reject the connection if the client has passed the HTTP error rate
http-request add-header X-Haproxy-ACL %[req.fhdr(X-Haproxy-ACL,-1)]high-error-rate, if { sc0_http_err_rate(),sub(proc.max_http_err_rate) gt 0 }
http-request capture sc0_http_err_rate() len 4 if { sc0_http_err_rate(),sub(proc.max_http_err_rate) gt 0 }
# Reject the connection if the client has passed the HTTP request rate
http-request add-header X-Haproxy-ACL %[req.fhdr(X-Haproxy-ACL,-1)]high-request-rate, if { sc0_http_req_rate(),sub(proc.max_http_req_rate) gt 0 }
http-request capture sc0_http_req_rate() len 4 if { sc0_http_req_rate(),sub(proc.max_http_req_rate) gt 0 }
# Insert a unique request identifier in the headers of the request
# passed to the backend
unique-id-format %{+X}o\ %ci:%cp_%fi:%fp_%Ts_%rt:%pid
unique-id-header X-Unique-ID
declare capture request len 256
http-request capture req.fhdr(X-Haproxy-ACL) id 0
capture request header Host len 64
capture request header Referer len 164
capture request header Content-Length len 10
capture request header User-Agent len 96
#-- strip out the port part of the Host header
http-request set-header host %[hdr(host),field(1,:)]
use_backend alb_backend
backend alb_backend
# Remove the ACL header
http-request del-header ^X-Haproxy-ACL
server-template alb 5 alb.domain.com:443 ssl verify none check port 443 resolvers aws
The instance is not overloaded or something:
11:40:01 AM CPU %user %nice %system %iowait %steal %idle
...
03:10:01 PM all 2.42 0.00 2.12 0.06 1.79 93.61
03:20:01 PM all 2.85 0.00 2.20 0.07 1.81 93.06
03:30:01 PM all 2.81 0.00 2.04 0.12 1.71 93.32