High number of retries


#1

Hello Everyone!

I just started logging all haproxy web requests to elasticsearch and have been running some analysis on the long running requests. I came across a high number of requests that are being retried - meaning they hit the “timeout connect 10s” limit and were retrieved successfully.

Our HAProxy box runs on a 1gb/s nic and has at most 30mb/s coming out of it. We have about 400 r/s coming into haproxy and split pretty evenly to our 4 webservers. Looking at the stats we have .8/s retry rate. That seems awfully high to me… I understand that number will never be zero but almost 1 request a second is having to retry.

I’ve been monitoring/logging tons of networking metrics on our webserver side and they seems fine. No queueing on their side, normal cpu (<30%), and around 30mb/s also on a 1gb/s nic.

I’ve set the timeout lower to “3100” and the rate stayed the same.

My main question is what are the normal causes of the high retry rate? Nic being saturated? Our servers sit right next to each other so latency shouldn’t be an issue.

The only other thing I have thought of is we use websockets with haproxy and have a high number of connections (about 6000) at this writing.

Any help or direction in digging into this would be appreciated!
Paul


#2

Retries really only happen because the TCP connection can’t be established.

I would take a look at the following things:

  • packet loss between haproxy and the backend server
  • stateful firewall (iptables/conntrack) on the haproxy box (check rules and dmesg)
  • stateful firewall (iptables/conntrack) on the backend servers (check rules and dmesg)
  • stateful firewalls between haproxy and the backend servers
  • any TCP backlog overflows in your backend (are syn cookies enabled? check all netstat counters like TcpExtTCPBacklogDrop)

Depending on your haproxy configuration, release and backend capability you may be using http-keep-alive or not; this will have a big impact and it (keep-alive) may be able to hide the underlying problem if used.


#3

Thanks @lukastribus for replying.

  • Packet loss is really low but we do have tcp retransmissions that seem to coincide with the request that are not connecting quickly in haproxy.
  • Firewalls are all are fine, and dmesg’s are clean.
  • tcp_syncookies is enabled.
  • TcpExtTCPBacklogDrop - That doesn’t exist on our netstat -s output. We are running Ubuntu 16.04 here. Is there something similar to be looking there?
  • We are running default http-keep-alive, and setting the timeout to: timeout http-keep-alive 1s. What do you mean hiding the underlying problem?

I have a feeling the timeouts are do to the retransmissions… is there a way to force haproxy to lower that retransmission time?

Edit: would setting the timeout connect to something like 100ms be ok for our LAN network? I understand it would just mask any issues with the network?


#4

More data:

We are seeing TCP retransmissions during the times haproxy has issues connecting. Based off a wireshark dump i see that it’s doing it’s first TCP retransmit at 1second, and then another at 3seconds. So my 3100ms connect timeout seems to be right on.

I’m checking with our datacenter to see if they have anything to say… at this point I’m not sure what I can set to fix a shitty network…


#5

Yes, you can play with timeout connection and retries.

However, depending on the type of packet loss issue, you may be making the problem worse. For example: if the network has short moments where it drops all/most the traffic (as opposed to small and constant packet loss), then you need to make sure that timeout connect * retriers is higher than the duration of the network stall. This could be the case if a network policier intervenes, for example.

Otherwise you may run out of retries and then you will actually return an error to the client.

So you probably want to reduce the “timeout client”, but I also suggest you increase retries at this point in order to not break the request.

I just mean that if you would not use http-keep-alive, the problem would be worse in that you have more TCP handshakes and teardowns, so you would also have a higher number of connection attempts and therefor, retries. The more efficient http-keep-alive is with your backend, the less this problem will affect retries.

So:

  • lower “timeout client”. On a LAN, or lets say <1 ms network, 100 ms shoud be fine.
  • increase retries to avoid side effects (otherwise the request will fail after 300 ms)
  • consider increasing timeout http-keep-alive to something like 5 - 10 seconds, so that keep-alive is used even more; also:
  • check whether any of the http-reuse options work for you to further increase keep-alive effiency

This points may reduce the impact of the issue until you can find and fix the root cause. It will definitely increase the actual retries number though, as you’d retry faster and more often.


#6

Thanks for the advice. I set the timeout client to 100ms, retries up to 5. After setting that we are still getting the same rate of retries, but obviously they fail faster and the 2nd attempt does connect. Sooooooo it’s masking the issue at this point pretty well.

I did update the timeout which didn’t reduce the retries.

I’m going to investigate the http-reuse options and validate that it will work for our setup.

Also I have our data center spinning up a new server to make sure it’s not some weird hardware issue.

Thanks, I’ll keep this post updated as I find more info.