Windows Mobile and haproxy - Unable to read data from the transport connection

We have installed an haproxy server in front of our web service and API. It decrypts the requests (TLS), parses the headers, then forwards to either our internal servers or our AWS cloud service depending on request parameters (both also TLS).

The service handles about 30 million requests a day. Almost all work fine, but we have a problem with a small number of requests from (ancient) Windows Mobile handhelds. The users of the devices frequently experience problems connecting to our API together with this device-level error message: “Unable to read data from the transport connection

HaProxy doesn’t log any errors when this problem happens. Due to the handhelds being in client premises we’re also unable to determine whether the bad requests are logged by haproxy as successful as they are surrounded by many genuinely successful requests.

There are no network-level errors reported by the network interface.

Removing haproxy from the equation stops the problem. Putting it back starts the problem again. But I’m at a complete loss as to what might be happening. If anybody has seen anything like this then I’d be very glad to hear from you.

Update

Some customers tell me that this problem happens if their device is idle for more than a couple of minutes between uses. I wonder if there’s some keepalive timeout not respected by haproxy that impacts Windows Mobile (but is handled ok by more modern systems).

Other information

We compilled this haproxy ourselves as we needed to support SSlv3 (yes I know about the security issues, but we have no choice but to support clients who won’t countenance changing their 500 in-factory devices just because we tell them that they should).

Output from haproxy -vv

Summary
 HA-Proxy version 1.8.19 2019/02/11
Copyright 2000-2019 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-format-truncation -Wno-null-dereference -Wno-unused-label
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2r  26 Feb 2019
Running on OpenSSL version : OpenSSL 1.0.2r  26 Feb 2019
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
        [SPOE] spoe
        [COMP] compression
        [TRACE] trace

Our haproxy.conf (redacted)

Summary
global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
        stats timeout 30s
        user haproxy
        group haproxy
        maxconn 4000
        daemon

        # Default SSL material locations
        ca-base /etc/ssl/certs
        crt-base /etc/ssl/private
        ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS
        ssl-default-bind-options ssl-min-ver SSLv3 

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        option http-buffer-request              # Needed to inspect the POST request parameters (CA/WMS)
        timeout connect 5000
        timeout client  50000
        timeout server  600000
        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http

resolvers mydns
  nameserver dns1 8.8.8.8:53
  nameserver dns2 10.200.0.101:53
  nameserver dns3 10.101.1.11:53
  resolve_retries       3
  timeout resolve       1s
  timeout retry         1s
  hold other           30s
  hold refused         30s
  hold nx              30s
  hold timeout         30s
  hold valid           10s
  hold obsolete        30s

listen  stats
        bind 0.0.0.0:8000
        mode            http
        log             global

        maxconn 10

        clitimeout      100s
        srvtimeout      100s
        contimeout      100s
        timeout queue   100s

        stats enable
        stats hide-version
        stats refresh 30s
        stats show-node
        stats auth xxx:xxx
        stats uri  /haproxy?stats


#
# We listen on a single front-end that is bound to both ports 80 and 443. The 443 bind applies the secure certificate.
#
# When a request is received, these parts of the request are scanned:
# - The URL path
# - The REFERER header (needed for some anonymous assets that are loaded at login time)
# - The request body (for POST requests)
#
# If any one of these contains any of the client id strings present in the file "/etc/haproxy/aws_clients"
# then a match is found and the AWS back-end will be used. 
# Otherwise the in-house back-end will be used - either secure or insecure depending on the request.
#

frontend wms

  maxconn 4000
# Looks for the client id in the parameters
  acl aws_path urlp_sub -i -f /etc/haproxy/aws_clients

# Looks for the client id in the URL path
  acl aws_path path_sub -i -f /etc/haproxy/aws_clients

# Looks for the client id in the Referer header
  acl aws_referer req.hdr(Referer) -i -m sub -f /etc/haproxy/aws_clients

# Looks for the client id in the Cookie header
  acl aws_cookie req.hdr(Cookie) -i -m sub -f /etc/haproxy/aws_clients

# Looks for the client id in the POST body. 
  acl aws_param req.body -i -m sub -f /etc/haproxy/aws_clients

# Determines whether this is a secure or insecure request
  acl is_ssl dst_port eq 443

  bind *:80
  bind *:443 ssl crt /etc/ssl/haproxy_pvx.pem
  mode http

# wms_cloud back-end is used if any one of the match criteria is met
  use_backend wms_cloud if aws_path or aws_referer or aws_cookie or aws_param

# Otherwise use ovh either secure or insecure
  use_backend wms_ovh_ssl if is_ssl
  default_backend wms_ovh


# Back-end definitions - cloud, ovh insecure, ovh secure

backend wms_cloud
  mode http
  option httpchk
  server wmscloud1 cloud.ourservice.net:443 check resolvers mydns ssl verify none 

backend wms_ovh
  mode http
  server wms1 10.200.0.110:80

backend wms_ovh_ssl
  fullconn 4000
  mode http
  server wms1 10.200.0.110:443 ssl verify none maxconn 4000

It’s hard to tell the root cause here, if you are unable to reproduce it in a clean environment with logs and everything, but I’d suggest to set in your defaults section:

timeout http-keep-alive 1s
timeout http-request 10s

Currently timeout client strikes for those 2 timeouts with 50s, which may not be ideal for a number of reasons. Closing the HTTP request earlier may permit a cleaner shutdown of the HTTP(S) connection, before those handhelds go into some kind of standby mode.

But, that’s just a guess.

Many thanks for this, I’ll give it a try.

So the above changes made everything far worse, all client WindowsMC devices were repeatedly falling over with connection errors.

So I’ve gone the other way and increased the timeout to 120s (from the default of 50). So far we haven’t seen the problem again.

That means those clients are unable to handle a graceful close from server side, which is bad. Hopefully with 120s of keep-alive timeout the client closes first.