Unable to receive gRPC error during a response streaming call using HAProxy in HTTP mode

I’m working on a gRPC application that will be served behind HAProxy using the http mode. If the server application immediately (ie. before sending any responses) aborts the response streaming call with specific error, then the client application will receive CANCELLED error instead of the one that was sent. The error details will be Received RST_STREAM with error code 8. I’m working with HAProxy 2.3.2 and grpc 1.34.0.

For each of such requests, HAProxy log entry have SD-- flags set in the session state at disconnection field, eg.

<134>Jan  9 18:09:39 1a8328663d74 haproxy[8]: 172.28.0.4:41698 [09/Jan/2021:18:09:39.346] grpc-server-fe grpc-server-fe/grpc-server-be 0/0/0/-1/+0 -1 +115 - - SD-- 1/1/0/0/0 0/0 "POST http://proxy:6000/service.Service/StreamStream HTTP/2.0"

In the HAProxy documentation those flags are defined as follows:

  • S: the TCP session was unexpectedly aborted by the server, or the server explicitly refused it.
  • D: the session was in the DATA phase.

Additionally:

SD The connection to the server died with an error during the data transfer. This usually means that haproxy has received an RST from the server or an ICMP message from an intermediate equipment while exchanging data with the server. This can be caused by a server crash or by a network issue on an intermediate equipment.

Knowing that, and having in mind that the HTTP connection was opened with an intention of data streaming, as a workaround I have tried sending an empty gRPC message before raising the error. The solution have helped partially - the error codes could have been received by the client for the most requests, but the issue still happened from time to time.

As a next step I have inspected the network traffic using wireshark. Following is a trace of the HTTP response served by the gRPC server on an event of immediate call abortion:

HyperText Transfer Protocol 2
    Stream: SETTINGS, Stream ID: 0, Length 0
        Length: 0
        Type: SETTINGS (4)
        Flags: 0x01
        0... .... .... .... .... .... .... .... = Reserved: 0x0
        .000 0000 0000 0000 0000 0000 0000 0000 = Stream Identifier: 0
    Stream: HEADERS, Stream ID: 1, Length 88, 200 OK
        Length: 88
        Type: HEADERS (1)
        Flags: 0x05
            .... ...1 = End Stream: True
            .... .1.. = End Headers: True
            .... 0... = Padded: False
            ..0. .... = Priority: False
            00.0 ..0. = Unused: 0x00
        0... .... .... .... .... .... .... .... = Reserved: 0x0
        .000 0000 0000 0000 0000 0000 0000 0001 = Stream Identifier: 1
        [Pad Length: 0]
        Header Block Fragment: 88400c636f6e74656e742d74797065106170706c69636174...
        [Header Length: 120]
        [Header Count: 4]
        Header: :status: 200 OK
        Header: content-type: application/grpc
        Header: grpc-status: 7
        Header: grpc-message: Details sent by the server
    Stream: RST_STREAM, Stream ID: 1, Length 4
        Length: 4
        Type: RST_STREAM (3)
        Flags: 0x00
        0... .... .... .... .... .... .... .... = Reserved: 0x0
        .000 0000 0000 0000 0000 0000 0000 0001 = Stream Identifier: 1
        Error: NO_ERROR (0)

So the server sends response headers with details on error with End Stream and End Headers flags set. And then closes the stream with NO_ERROR code. According to the answer provided in should gRPC server-side half-closing implicitly terminate the client? Stack Overflow question, everything is ok at this stage. I have also briefly reviewed the RFC 7540 and couldn’t find anything that is off in the terms of HTTP/2 protocol.

Quoted gRPC server HTTP response is followed by HAProxy originating TCP ACK, and next, HAProxy dispatches its response to the client.

HyperText Transfer Protocol 2
    Stream: HEADERS, Stream ID: 1, Length 75, 200 OK
        Length: 75
        Type: HEADERS (1)
        Flags: 0x05
            .... ...1 = End Stream: True
            .... .1.. = End Headers: True
            .... 0... = Padded: False
            ..0. .... = Priority: False
            00.0 ..0. = Unused: 0x00
        0... .... .... .... .... .... .... .... = Reserved: 0x0
        .000 0000 0000 0000 0000 0000 0000 0001 = Stream Identifier: 1
        [Pad Length: 0]
        Header Block Fragment: 885f106170706c69636174696f6e2f67727063000b677270...
        [Header Length: 120]
        [Header Count: 4]
        Header: :status: 200 OK
        Header: content-type: application/grpc
        Header: grpc-status: 7
        Header: grpc-message: Details sent by the server
    Stream: RST_STREAM, Stream ID: 1, Length 4
        Length: 4
        Type: RST_STREAM (3)
        Flags: 0x00
            0000 0000 = Unused: 0x00
        0... .... .... .... .... .... .... .... = Reserved: 0x0
        .000 0000 0000 0000 0000 0000 0000 0001 = Stream Identifier: 1
        Error: CANCEL (8)

It can be seen that the flags and all the contents of the HEADERS frame are in place, so the error details are passed to the client, but the code of RST_STREAM has changed to CANCEL. In effect the client recieves all the expected data, but after that it receives unexpected RST_STREAM(CANCEL) that is being mapped on gRPC CANCELLED error, as specified in the gRPC documentation.

In the course of further investigation, I have referred to HAProxy source code. I have discovered that the code is set in the h2_do_shutr function of mux_h2.c (experiments with custom HAProxy builds have proven that it is indeed this place). The involved code branch has the following comment:

a final response was already provided, we don’t want this stream anymore. This may happen when the server responds before the end of an upload and closes quickly (redirect, deny, …)

So those are the details I have managed to gather on the issue. I’m not entirely sure whether the problem lies in the gRPC core (being too messy in the terms of HTTP2 streams handling) or HAProxy (being too careless while rewriting RST_STREAM codes).

The final question is, how can I tune the configuration of HAProxy and gRPC core server to work correctly on an event of immediate call abortion? At the moment, to overcome the issue, I have prepared custom HAProxy build that responds with RST_STREAM(NO_ERROR) on the described circumstances, but this solution feels rather like technology debt.

The minimal HAProxy configuration that reproduces the issue is as follows:

global
    log stdout local0

listen grpc-server-fe
    bind *:6000 proto h2

    mode http
    log global
    option logasap
    option httplog

    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

    server grpc-server-be server:6000 proto h2

I have prepared a repository with minimal example containing simple python client & server. It also contains docker-compose with network environment including configured HAProxy.