Strange connection issues with tcp (redis) under load

Hello,

I have a very strange connection issue, trying to proxy redis servers.

First, the relevant configuration extract:

defaults
    mode http
    log     global
    option  httplog
    option  dontlognull
    option  dontlog-normal
    option  http-ignore-probes
    timeout connect 5000
    timeout client  50000
    timeout server  50000
    maxconn 25000

frontend redis
    bind *:6362
    mode tcp
    use_backend redis
    timeout client 3h
    option srvtcpka

backend redis
    mode tcp
    timeout server 3h
    option clitcpka
    
    server cache01 cache01:6381 check

Under heavy load, connections are randomly dropped, but limits are not hit when I check haproxy stats (~50 open connections).

(To add extra details, it’s a celery ‘client’ instance, using redis as a result backend, failing to submit new tasks).

HaProxy logs show this:

Jan 28 11:40:30 localhost haproxy[17014]: 127.0.0.1:37852 [28/Jan/2019:11:08:59.155] redis redis/cache01 1/0/1891613 2421222 SD 133/98/97/97/0 0/0
Jan 28 11:40:41 localhost haproxy[17014]: 127.0.0.1:43558 [28/Jan/2019:11:40:30.805] redis redis/cache01 1/0/10482 2605056 CD 133/98/97/97/0 0/0

Extra remark: it seems the double disconnect always together.

Extra details: the client application receive a socket.timeout (Python).

There is no issues with a direct connection: this only happens with HaProxy in between (but we need it ^^).

The issue happens on different system / linux kernls: We can reproduce it locally on a single “basic” archlonux machine or in debian stable containers on a “powerfull” server.

Having no idea who was responsible (Celery ↔ Ha or Ha ↔ Redis), we started to dump networks packets:

For the first connection (37852, closed at 11:40:30), we have the following timeline:

11.08.59 celery → ha Connection opened. ha → redis opened a few milliseconds later
11:09:00 ha → celery: At tcp level, the window seem to be full (On celery side)
11:09:00 celery → ha: Celery is starting to send tcp packets with a window of zero
11:09:10 After a few keep-alive from ha to celery, celery start again to send data, but window is still set to 0. No data traffic from ha to celery anymore in the full dump.

11:11:05 redis → ha: At tcp level, window seems to be full (On haproxy side)
11:11:05 ha → redis: HaProxy is starting to send tcp packets with a window of zero
11:11:11 After a few keep-alive, haproxy start again to send data, but window is still set to 0. No data traffic form redis to celery will be seen again.

11:21:17 redis → ha: At tcp level, window seems to be full (On redis side)
11:21:17 redis start to send tcp packets with a window of zero as well
11:22:35 redis close the connection

11:36:45: ha → redis: At tcp level, window seems to be full (On haproxy side). At thing point no more traffic is exchanged
11:38:50 Last tcp ack
11:40:30: Celery closed the connection

Remark 1: There is data “lost” in HaProxy buffer: the traffic redis → ha → celery is not send anymore at 11:09:00, and redis → ha is stopped at 11:11:05 (so there is ~2 minutes of traffic in haproxy buffers).

Remark 2: Why redis didn’t closed the front connection when the back connection was lost ? I’m 99% sure that I see the two connections in tcp dump (correct timing, same data when everything was working)

For the second connection, the issue is quite similar:

Timeline:

11:40:30: Connection opened
11:40:33: Last message from HA to celery (window full), but no pause in data from celery to ha
11.40:41: Celery close the connection
11:40:41: Ha close the connection to redis

This second case make more sense (no pause in data flow, front is closed to back is closed immediately).

Some screenshot of tcpdumps in case I missed something:

Case 1:

SYNs

Celery full

Backend Haproxy full

Redis full and clsoe

Frontend Haproxy full and clsoe

Case 2:

SYNs


Celery full

Celery close

Redis close

I don’t understand why this is happening only with haproxy in between. There is load on the system, but it’s seems unlikely that the celery client is unable to find time to consume his buffer for ~30 minutes, especially when he find time to send data.

This cloud be a client (redis-py is used by celery) issue, but I was wondering if it could be a haproxy configuration issue and/or someone have any hits to try to fix the issue.

Thanks for your help!

When you connect directly, do you have all those zero window messages as well? Any indications in the logs of client and server (other than plain socket timeouts).

I suggest to double check iptables and conntrack. Even if this is all happening on loopback, it doesn’t not mean conntrack does not intervene.

Also please provide the output of haproxy -vv (and whether or not you used different releases).

Hello,

My apologises, I forgot to update this thread.

At the end I was indeed a celery problem, who was not consuming his buffer, leading to issues. HaProxy in the between added an extra buffer, creating this strange behaviour, but everything have been “fixed” with changes on celery side.

For reference: https://github.com/celery/celery/issues/5320#issuecomment-460295593

Sorry for the useless thread - I was not haproxy fault at all ^^’

1 Like