Load Balancing TensorFlow GPUs using GRPC on multiple ports

skranjanin · October 27, 2020, 8:55am

Hi all,
Is there any extra setting that needs to be enabled when load balancing grpc services running in a docker?, because we are trying to load balance our backend - which in this case are Tesla GPUs running the same Tensorflow models on different ports, all inside a docker. The Client is making grpc calls, where it is sending multiple images for the models running on the GPUs to consume. We have haproxy between the client and the GPUs, running on a separate VM, hoping it will load balance requests to these GPUs. But unfortunately, we are unable to make it work. We keep getting one of the other errors. On the GPU console, we could see that out of the 8 images for instance, only one image is getting processed and then we end up with an exception which is below -

status = StatusCode.UNAVAILABLE
details = “failed to connect to all addresses”
debug_error_string = “{“created”:”@1603277976.225947468",“description”:“Failed to pick subchannel”,“file”:“src/core/ext/filters/client_channel/client_channel.cc”,“file_line”:3941,“referenced_errors”:[{“created”:"@1603277976.225941688",“description”:“failed to connect to all addresses”,“file”:“src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc”,“file_line”:393,“grpc_status”:14}]} and sometimes this -

status = StatusCode.CANCELLED
details = “Received RST_STREAM with error code 8”
debug_error_string = “{“created”:”@1603183282.697214100",“description”:“Error received from peer ipv4:[IPaddress:8700]
(http://IPaddress:8700/)”,“file”:“src/core/lib/surface/call.cc”,“file_line”:1056,“grpc_message”:“Received RST_STREAM with error code 8”,“grpc_status”:1}"

Below is the relevant piece of haproxy Configuration ( P.S - here ‘loadbalancerIP’ and ‘IPaddress’ are respective IP addresses).

global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
        stats timeout 30s
        user haproxy
        group haproxy
        daemon
        maxconn 50000
        #only for debugging
        debug

        # Default SSL material locations
        #ca-base /etc/ssl/certs
        #crt-base /etc/ssl/private

        # See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate
        #ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
        #ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
        #ssl-default-bind-options no-sslv3 no-tls-tickets
        #ssl-default-server-options no-sslv3 no-tls-tickets

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        timeout connect 5000
        timeout client  50000
        timeout server  50000
        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http
        option http-use-htx
        #option logasap
        maxconn 3000
frontend loadbalancernode
        bind 'loadbalancerIP':8700 proto h2
        bind 'loadbalancerIP':8600 proto h2
        bind 'loadbalancerIP':8601 proto h2
        bind 'loadbalancerIP':8605 proto h2
        default_backend gpu_servers

backend gpu_servers
        balance leastconn
        mode http
        server server1_8700 'IPaddress':8700 proto h2 check
        server server1_8600 'IPaddress':8600 proto h2 check
        server server1_8601 'IPaddress':8601 proto h2 check
        server server1_8605 'IPaddress':8605 proto h2 check

        server server2_8700 'IPaddress':8700 proto h2 check
        server server2_8600 'IPaddress':8600 proto h2 check
        server server2_8601  'IPaddress':8601 proto h2 check
        server server2_8605  'IPaddress':8605 proto h2 check

        server server3_8700 'IPaddress'1:8700 proto h2 check
	    server server3_8600 'IPaddress':8600 proto h2 check
        server server3_8601 'IPaddress':8601 proto h2 check
        server server3_8605 'IPaddress':8605 proto h2 check

listen stats
  bind  :30000
  mode  http
  stats enable
  stats uri /haproxy?stats
  stats hide-version
  stats refresh 60
  stats realm Haproxy-Statistics
  stats auth admin:password

Without haproxy, the calls work perfectly fine, so not sure what haproxy is adding/removing from the client calls that are forwarded to the backend servers.
Any help would be much appreciated.

Thanks,
Ranjan

skranjanin · October 29, 2020, 5:32am

Update:
Just for testing, set up Nginx with the same GPUs. And it worked. So definitely, a problem with either the configuration file or Haproxy itself, when it comes to grpc.

Topic		Replies	Views
Kubernetes + headless service + grpc Help!	1	908	June 25, 2019
Grpc + haproxy + docker not working Help!	0	1396	February 23, 2018
HAProxy performance issue when proxying to gRPC Help!	12	7840	July 16, 2019
Frequent connection retries and timeouts with gRPC Help!	0	1813	September 8, 2020
Error when working with grpc Help!	0	1431	March 18, 2019

Load Balancing TensorFlow GPUs using GRPC on multiple ports

Related topics