Hi,
We are attempting to use HAProxy to load balance gRPC requests (L7) across 6 app servers, which have nginx in front of the app. We are using TLS between nginx and HAProxy, and TLS with a clientside certificate between HAProxy and gRPC clients. We get very frequent retries, and some 503 timeouts, with no easily discernible cause. We cannot find any evidence of packet loss or drops, or firewall issues. Also, there are other services running on the same machines, using the same nginx and the same HAProxy instances, which do not have this issue (they are TCP load balanced however. gRPC is the only service we are doing HTTP load balancing with).
The logs look like this for 503s:
Sep 8 13:24:40 nl14s0143 haproxy[25638]: 10.119.0.48:42061 [08/Sep/2020:13:24:36.971] lb_https_grpc~ bk_grpc/srv1 0/3042/-1/-1/3057 503 237 - - SC-- 13413/967/4/0/+3 0/0 {6238257832466128252} {} "POST /Service/Thing HTTP/2.0"
and like this when requests do end up going through:
Sep 8 13:28:10 nl14s0143 haproxy[25638]: 10.119.0.50:19933 [08/Sep/2020:13:28:09.173] lb_https_grpc~ bk_grpc/srv2 0/1016/0/12/1028 200 285 - - ---- 13639/619/2/0/+1 0/0 {5313047619323712921} {} "POST /Service/Thing HTTP/2.0"
Retries happen approximately 10% of the time, while 503s happen around 1% of the time.
Unfortunately our full config is several thousand lines so I cannot post the entire thing. If I need to share some specific parts let me know.
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 777 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
nbproc 1
nbthread 16
cpu-map auto:1/1-16 0-15
maxconn <%= @global_max_conn.to_i %>
ca-base /etc/ssl/certs
server-state-base <%= @backend_states_dir %>
hard-stop-after 12h
ssl-default-bind-ciphers <%= @ssl_ciphers.join(':') %>
ssl-default-bind-options ssl-min-ver <%= @ssl_protocols.last %>
tune.ssl.default-dh-param 2048
defaults
mode tcp
timeout client 330s
timeout server 330s
timeout connect 320s
timeout tunnel 900s
timeout client-fin 10s
timeout server-fin 10s
timeout check 500ms
timeout queue 1s
timeout tarpit 1s
option tcp-smart-connect
option redispatch 1
http-reuse safe
load-server-state-from-file local
frontend lb_https_grpc
mode http
log global
option httplog
maxconn 90000
<%- @grpc_l7_vip.each do |vip| -%>
bind <%= vip %>:443 ssl crt /etc/haproxy/ssl/cert.com.pem alpn h2 ca-file /etc/nginx/ssl/trusted.pem verify optional
<%- end -%>
option http-use-htx
http-request set-header X-SSL %[ssl_fc]
http-request set-header X-SSL-Client-Verify %[ssl_c_verify]
http-request set-header X-SSL-Client-SHA1 %{+Q}[ssl_c_sha1]
http-request set-header X-SSL-Client-DN %{+Q}[ssl_c_s_dn]
http-request set-header X-SSL-Client-CN %{+Q}[ssl_c_s_dn(cn)]
http-request set-header X-SSL-Issuer %{+Q}[ssl_c_i_dn]
http-request set-header X-SSL-Client-Not-Before %{+Q}[ssl_c_notbefore]
http-request set-header X-SSL-Client-Not-After %{+Q}[ssl_c_notafter]
http-request deny if !{ ssl_fc_has_crt }
use_backend bk_grpc if { ssl_fc_has_crt } { ssl_fc_sni -i grpc-api.grpcbackend.com }
backend bk_grpc
mode http
balance leastconn
retry-on empty-response conn-failure response-timeout
option httpchk GET /ws-status "HTTP/1.0\r\nHost: status.grpcbackend.com"
http-check expect status 200
default-server weight 100 inter 2s fall 1 rise 2 agent-port 6667 agent-inter 3s on-error mark-down on-marked-down shutdown-sessions ssl verify required ca-file cert.pem alpn h2 maxconn 10000 check check-ssl check-sni status.grpcbackend.com send-proxy-v2-ssl port 444 check-send-proxy check-alpn http/1.0 agent-check agent-send state\n
server srv1 10.120.10.111:450 agent-addr 10.120.10.111
server srv2 10.120.12.164:450 agent-addr 10.120.12.164
server srv3 10.120.10.135:450 agent-addr 10.120.10.135
server srv4 10.120.12.237:450 agent-addr 10.120.12.237
server srv5 10.119.0.52:450 agent-addr 10.119.0.52
server srv6 10.119.0.53:450 agent-addr 10.119.0.53
We are using Debian Stretch and HAProxy 2.0 from haproxy.debian.net. Does anyone have any clues as to how to find the cause for this?