502 Bad gateway where large files are uploaded

Hi,
I get intermittent failures when uploading largish files (5M) via haproxy. I have a fairly simple setup at this stage with haproxy fronting two servers (custom) with SSL termination.
The server is gobbling data at high rate, close to 64k per read() invocation and has no trouble keeping up, however, at some point it reads zero before all data is transmitted through from the client, because haproxy has terminated the connection.

I upgraded to the latest stable release, but that did not make a difference:

HA-Proxy version 1.7.0-e59fcdd 2016/11/25
Copyright 2000-2016 Willy Tarreau willy@haproxy.org

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
OPTIONS = USE_ZLIB=1 USE_OPENSSL=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built without PCRE support (using libc’s regex instead)
Built without Lua support
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[COMP] compression
[TRACE] trace
[SPOE] spoe

This is a log entry for a failed upload:

Dec 8 02:01:14 localhost haproxy[1791]: 139.162.56.133:33638 [08/Dec/2016:02:01:12.807] api_farm~ bk_api/ufsrv0 0/0/1/-1/1204 502 204 - - SH-- 0/0/0/0/0 0/0 {} {} “PUT /V1/Account/Attachment/8de29bf901bcb895c3a287821 HTTP/1.1”

Config file:

global
stats socket /var/run/haproxy.sock mode 600 level admin
stats timeout 2m
log 127.0.0.1 local0
maxconn 10024
tune.ssl.default-dh-param 2048
tune.maxrewrite 1024
tune.bufsize 17408 #16k+1k
debug

defaults
log global
mode http
option http-server-close
option httplog
option tcplog
option dontlognull
retries 3
option redispatch
option contstats
option forwardfor
timeout client 60s
timeout connect 5s
timeout server 60s
timeout tunnel 3600s
timeout http-keep-alive 1s
timeout http-request 60s
timeout queue 30s
timeout tarpit 60s

frontend ufsrvapi_farm
bind *:20080 ssl crt /opt/etc/ssl/xxxxx.pem
mode http
option httplog
capture cookie session len 100
capture request header X-UFSRVCID len 64
capture response header X-UFSRVCID len 64
default_backend bk_ws
stats enable
stats uri /haproxy_ufsrvapi
stats refresh 30s

stats auth admin:admin

    default_backend bk_ufsrvapi

backend bk_ufsrvapi
mode http
balance roundrobin

    server ufsrv0 127.0.0.1:20081 weight 10 check
    server ufsrv1 127.0.0.1:20082 weight 10 check

Or is it the other way around? The application is eager to read(), the kernel has no data on this socket yet, and the application erroneously terminates the connection? How does the application wait for data before read()ing?

SH translates to this (see doc [1]):

The server aborted before sending its full HTTP response headers, or
it crashed while processing the request. Since a server aborting at
this moment is very rare, it would be wise to inspect its logs to
control whether it crashed and why. The logged request may indicate a
small set of faulty requests, demonstrating bugs in the application.
Sometimes this might also be caused by an IDS killing the connection
between haproxy and the server.

I strongly suggest to triple check that read() handling in the application.

[1] HAProxy version 1.7.14 - Configuration Manual

Thanks for your perspective, especially the link to the document. At this stage I have ruled out haproxy as being the culprit, but I am yet to figure out why I am getting zero reads. The server doesn’t crash or anything just occasionally reads zero prematurely on otherwise non blocking sockets.

Curiously, I did a tcpdump trace and the client (curl in this instance) was not initiating close when the server read zero. I did that whilst the server was under the instrumentation of valgrind to ensure there was no funny memory business affecting the internals of the server. I think I am in for some painful debugging on this one.

You probably want to run your application through strace, that will help understanding what happens on a system call level.