HAProxy community

Frequent connection retries and timeouts with gRPC


We are attempting to use HAProxy to load balance gRPC requests (L7) across 6 app servers, which have nginx in front of the app. We are using TLS between nginx and HAProxy, and TLS with a clientside certificate between HAProxy and gRPC clients. We get very frequent retries, and some 503 timeouts, with no easily discernible cause. We cannot find any evidence of packet loss or drops, or firewall issues. Also, there are other services running on the same machines, using the same nginx and the same HAProxy instances, which do not have this issue (they are TCP load balanced however. gRPC is the only service we are doing HTTP load balancing with).

The logs look like this for 503s:

Sep  8 13:24:40 nl14s0143 haproxy[25638]: [08/Sep/2020:13:24:36.971] lb_https_grpc~ bk_grpc/srv1 0/3042/-1/-1/3057 503 237 - - SC-- 13413/967/4/0/+3 0/0 {6238257832466128252} {} "POST /Service/Thing HTTP/2.0"

and like this when requests do end up going through:

Sep  8 13:28:10 nl14s0143 haproxy[25638]: [08/Sep/2020:13:28:09.173] lb_https_grpc~ bk_grpc/srv2 0/1016/0/12/1028 200 285 - - ---- 13639/619/2/0/+1 0/0 {5313047619323712921} {} "POST /Service/Thing HTTP/2.0"

Retries happen approximately 10% of the time, while 503s happen around 1% of the time.

Unfortunately our full config is several thousand lines so I cannot post the entire thing. If I need to share some specific parts let me know.

  log /dev/log local0
  log /dev/log local1 notice
  chroot /var/lib/haproxy
  stats socket /run/haproxy/admin.sock mode 777 level admin expose-fd listeners
  stats timeout 30s
  user haproxy
  group haproxy
  nbproc 1
  nbthread 16
  cpu-map auto:1/1-16 0-15
  maxconn <%= @global_max_conn.to_i %>
  ca-base /etc/ssl/certs
  server-state-base <%= @backend_states_dir %>
  hard-stop-after 12h
  ssl-default-bind-ciphers <%= @ssl_ciphers.join(':') %>
  ssl-default-bind-options ssl-min-ver <%= @ssl_protocols.last %>
  tune.ssl.default-dh-param 2048

  mode tcp
  timeout client 330s
  timeout server 330s
  timeout connect 320s
  timeout tunnel 900s
  timeout client-fin 10s
  timeout server-fin 10s
  timeout check 500ms
  timeout queue 1s
  timeout tarpit 1s
  option tcp-smart-connect
  option redispatch 1
  http-reuse safe
  load-server-state-from-file local

frontend lb_https_grpc
  mode http
  log global
  option httplog
  maxconn 90000
<%- @grpc_l7_vip.each do |vip| -%>
  bind <%= vip %>:443 ssl crt /etc/haproxy/ssl/cert.com.pem alpn h2 ca-file /etc/nginx/ssl/trusted.pem verify optional
<%- end -%>
  option http-use-htx

  http-request set-header X-SSL                       %[ssl_fc]
  http-request set-header X-SSL-Client-Verify         %[ssl_c_verify]
  http-request set-header X-SSL-Client-SHA1           %{+Q}[ssl_c_sha1]
  http-request set-header X-SSL-Client-DN             %{+Q}[ssl_c_s_dn]
  http-request set-header X-SSL-Client-CN             %{+Q}[ssl_c_s_dn(cn)]
  http-request set-header X-SSL-Issuer                %{+Q}[ssl_c_i_dn]
  http-request set-header X-SSL-Client-Not-Before     %{+Q}[ssl_c_notbefore]
  http-request set-header X-SSL-Client-Not-After      %{+Q}[ssl_c_notafter]

  http-request deny if !{ ssl_fc_has_crt }
  use_backend bk_grpc  if { ssl_fc_has_crt } { ssl_fc_sni -i grpc-api.grpcbackend.com }

backend bk_grpc
  mode http
  balance leastconn
  retry-on empty-response conn-failure response-timeout
  option httpchk GET /ws-status "HTTP/1.0\r\nHost: status.grpcbackend.com"
  http-check expect status 200
  default-server weight 100 inter 2s fall 1 rise 2 agent-port 6667 agent-inter 3s on-error mark-down on-marked-down shutdown-sessions ssl verify required ca-file cert.pem alpn h2 maxconn 10000 check check-ssl check-sni status.grpcbackend.com send-proxy-v2-ssl port 444 check-send-proxy check-alpn http/1.0 agent-check agent-send state\n
  server srv1 agent-addr
  server srv2 agent-addr
  server srv3 agent-addr
  server srv4 agent-addr
  server srv5 agent-addr
  server srv6 agent-addr

We are using Debian Stretch and HAProxy 2.0 from haproxy.debian.net. Does anyone have any clues as to how to find the cause for this?