Intermittent 504 errors and sR-- after upgrade to 1.7.10

Hi,

Since upgrading from 1.7.9 to 1.7.10 I’m getting intermittent 504 errors with sR-- for the session termination state. Downgrading back to 1.7.9 fixes the issue. The log for those requests looks like:
%ac 2
%bc 1
%bq 0
%tsc sR–
%TR 0
%Tw 0
%Tc 5
%Tr -1
%Ta 5
%Th 0
%Ti 5874
%B 194
%ac 2
%fc 1
%bc 1
%sc 0
%rc 0
%sq 0
%bq 0

From the point of view of the the app on the backed, the request was processed and the answer sent back in 13ms.

Full haproxy config:

global
  node apne-proxy-1c
  log-send-hostname apne-proxy-1c
  log         127.0.0.1 local5
  chroot      /var/lib/haproxy
  pidfile     /var/run/haproxy.pid
  maxconn     4000
  user        haproxy
  group       haproxy
  daemon

  # Turn on stats unix socket
  stats socket /var/lib/haproxy/stats mode 750 level admin
  stats timeout 30s
  server-state-base /var/state/haproxy/
  server-state-file global

  # Tune TLS
  tune.ssl.default-dh-param 2048
  ssl-default-bind-ciphers ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA
  ssl-default-bind-options no-sslv3 no-tls-tickets
  ssl-default-server-options no-tls-tickets
  # The backend servers have self-signed certs
  ssl-server-verify none

  # Tune compression
  maxcompcpuusage 50
  tune.comp.maxlevel 3

  # Tune buffering to accomodate voice + client_cert
  tune.bufsize 32768
  tune.maxrewrite 4096

defaults
  mode     http
#  log      global
  log         127.0.0.1 format rfc5424 local4
  option   httplog
  log-format §apne-proxy-1c§%ci§%cp§%tr§%HM§%[capture.req.hdr(0)]§%HP§%HQ§%HV§%CC§%CS§%[capture.req.hdr(2)]§%[capture.res.hdr(0)]§%ST§%[capture.req.hdr(1)]§%ID§%sslc§%sslv§%[ssl_fc_sni]§%ft§%b§%s§%tsc§%TR,%Tw,%Tc,%Tr,%Ta,%Th,%Ti,%Ts,%B,%ac,%fc,%bc,%sc,%rc,%sq,%bq
  option   log-separate-errors
  option   dontlognull
  option   http-keep-alive
  option   forwardfor
  # Slow POST protection
  option   http-buffer-request
  http-reuse  safe
  load-server-state-from-file global
  retries  3
  timeout  http-request    10s
  timeout  queue           1m
  timeout  connect         10s
  timeout  client          1m
  timeout  server          1m
  timeout  http-keep-alive 10s
  timeout  check           10s
  maxconn  3000
  # Enable compression
  compression algo gzip
  compression type text/html text/plain text/css text/javascript application/javascript application/json application/x-javascript application/xhtml+xml application/xml application/rss+xml application/atom_xml image/svg+xml

peers mypeers
  peer apne-proxy-1b 10.29.0.75:1024
  peer apne-proxy-1c 10.29.1.247:1024

backend idx-backend
  balance roundrobin

  # The sessions to the admin portal are sticky
  stick-table type string len 32 size 1M expire 600s peers mypeers
  stick on req.cook(idx)
  stick store-response res.cook(idx)

  # Health Check
  option  httpchk GET /DaonIdentityX/Monitor/View
  http-check  expect rstring Overall\ Status:\ (OK|WARN)


  # Servers
  server apne-idx-1b apne-idx-1b.local:8443 ssl check inter 3000
  server apne-idx-1c apne-idx-1c.local:8443 ssl check inter 3000
  server apne-idx-1b-02 apne-idx-1b-02.local:8443 ssl check inter 3000
  server apne-idx-1c-02 apne-idx-1c-02.local:8443 ssl check inter 3000


frontend  idx-direct-server-auth-frontend-apne
  # Regular TLS
  bind  10.29.1.247:443 ssl crt /etc/haproxy/tls/example.com.pem

  # tcp current connections, tcp frequency counter, http frequency counter, http err frequency counter
  stick-table type ip size 1m expire 30s store gpc0,conn_cur,conn_rate(3s),http_req_rate(10s),http_err_rate(30s) peers mypeers

  # Track the connections
  tcp-request connection track-sc1 src

  # No tcp restrictions
  acl no_tcp_restrictions src -f /etc/haproxy/lists/no_tcp_restrictions.txt
  # Allowed IPs
  acl trusted_ips src -f /etc/haproxy/lists/trusted_ips.txt

  # No TCP connection restrictions from inside (RP servers and stuff)
  tcp-request connection accept if no_tcp_restrictions or trusted_ips

  # Reject over 50 concurrent connections per ip
  tcp-request connection reject if { src_conn_cur ge 50 }
  # Reject if more than 30 connections in 3 seconds
  tcp-request connection reject if { src_conn_rate ge 30 }

  # No TCP content restrictions from inside (RP servers and stuff)
  tcp-request content accept if no_tcp_restrictions or trusted_ips

  # Reject more than 30 http requests in 10 seconds
  tcp-request content reject if { src_http_req_rate ge 30 }
  # Reject if more than 10 http errors in 30 seconds
  tcp-request content reject if { src_http_err_rate ge 10 }

  # Cert pinning
  rspadd Strict-Transport-Security:\ max-age=31536000;\ includeSubDomains
  rspadd Public-Key-Pins:\ pin-sha256=\"enWPCYJtvn0yn99pyDJyBRNyw/iBUohykJd9EcFs+Js=\";\ pin-sha256=\"Sv0HHas+e1t1pseiclwN8WZ1i0h4w1/UxerXa2MKG0o=\";\ max-age=15768000;\ includeSubDomains

  # Hide the server info
  # 1.7.x bug
  #rspidel ^Server:.*$
  rspirep ^Server:.*$ Server:\ Hidden

  # Delay to catch L7 info
  tcp-request inspect-delay 5s
  # Capture the host header
  capture request header Host len 64
  # Capture the user agent header
  capture request header User-Agent len 128
  # Capture the SOAP action header
  capture request header SOAPAction len 128
  # Capture the REST correlation ID  header
  capture response header X-Idx-RequestReference len 64
  # Capture the idx cookie
  capture cookie idx len 32

  # Monitor uri
  acl site_dead nbsrv(idx-backend) eq 0
  monitor-uri /test
  monitor fail  if site_dead

  # Check the host
  acl correct_host hdr_reg(host) -i ^apne.example.com(:443)?$
  tcp-request content reject unless correct_host

  # IdX endpoints
  acl admin_portal path_reg ^(/[a-zA-Z0-9-]+)?/IdentityXServices
  acl monitor path_beg /DaonIdentityX/Monitor
  acl favicon path /favicon.ico

  # Allow the Admin portal
  http-request allow if admin_portal
  # Allow the monitor page from trusted IPs only
  http-request allow if monitor trusted_ips
  # Allow the favicon
  http-request allow if favicon

  # The default deny
  http-request deny

  # Backend
  default_backend idx-backend

Not sure if there is something I’m doing wrong, and I was relying on a bug fixed in 1.7.10, or there was a bug introduced in 1.7.10 :slight_smile:

Thanks,
Victor

I assume only a small percentage of requests are affected? Can you reproduce this easily or do you see this only with production traffic? Can you provide a tcpdump of a backend session with this problem?

If this is hard to reproduce, are you able to revert some of the patches on top of 1.7.10?

If yes, try, in this sequence to revert:

Or, if you are familiar with git bisecting, that would also be a possibility.

Hi Lukas,

Yes, there is only a small percentage of requests affected, in production, thus quite tricky to reproduce.
I can build a package with the reverted patches. Do you want me to test after each revert, or do you mean to revert all of them and test?

Thanks,
Victor

I meant just reverting a single patch for each build. Actually though, we probably better start with a proper git bisect right away, because I’m not really confident the changes above have something to do with this problem and I would probably just be wasting your time with it.

Git bisecting however will take some time as well, and you will probably have to test about 5~ builds until we get to the root cause change causing this issue. So to start the bisect you’d clone haproxy-1.7 from the git repositories, start the bisect and marking 1.7.9 “good” and 1.7.10 “bad”. After that, you build haproxy from the source, try it, and when you know the status you just type “git bisect good” or “git bisect bad”, and it will figure out the best next commit to try.

$ git clone http://git.haproxy.org/git/haproxy-1.7.git/ haproxy-1.7
$ cd haproxy-1.7
~/haproxy-1.7$ git bisect start
~/haproxy-1.7$ git bisect good v1.7.9
~/haproxy-1.7$ git bisect bad v1.7.10
Bisecting: 27 revisions left to test after this (roughly 5 steps)
[e6ee48f711cdc9f56b6f97d8804307570ea757fe] MINOR: server: Handle weight increase in consistent hash.
~/haproxy-1.7$ make [...]
[...]
~/haproxy-1.7$ bit bisect (good or bad)

But before you go down that road, can you try to disable compression? The filter infrastructure compression is using may be the root cause for this issue and we may be able to take a shortcut at bisecting if we know this only affects compression.

Sure, disabled the compression. Will monitor the logs for 504/sR–.

The weekend is not the best time to test this, but still, no sR-- since I’ve disabled compression on Friday morning.

Hi,

I’m still trying to understand where the bug could come from. But it puzzles me. There is only one way to have a 504 error. It is when a read timeout is reached during the response headers parsing. In this case, %Tr is always “-1”. So I’d be tempted to think this is what you observe. But, in such case, the session termination state should be sH-- and not sR–.

I need more information to move on. The “R” is stand for an error caught when HAProxy is waiting for a complete, valid REQUEST from the client. So it should not be possible to have a 504 in this case. And I definitely don’t understand the link with the compression filter.

As Lukas said, It may be a good to find the commit responsible of your problem with a bisect. It could be interesting to isolate the client request and the server response when it happens.

I’m also seeing this problem - intermittent 504 responses with a sR-- code since upgrading to 1.7.10. Are there any Ubuntu Trusty packages available for 1.7.9 so that I can try downgrading? The PPA from https://haproxy.debian.net/ removes old packages so only 1.7.10 is currently available.

No, I’m not aware of old binary packages.

As Christopher said, we need more informations in order to fix this. Please confirm whether disabling compression fixes this issue and please try to bisect it. Unless we know the offending commit or are able to reproduce the issue, a fix will be next to impossible.

Unfortunately can’t test this in prod and I’m unsuccessfully trying to reproduce it in a simulated environment…
@gravitystorm, do you have any unusual settings in your haproxy config, compression, header manipulation, etc?

Not sure if this helps in any way: https://serverfault.com/questions/876021/intermittent-504-errors-with-haproxy

Thanks @Victor for the link. It interesting, because the problem seems to exist without the compression. And increasing timeouts seems to hide the bug. I will try to figure out why and how this is related.

Ok guys. I think I found the bug. In fact, it was pushed in 1.8 to fix an HTTP/2 issue. Because it was not clear to known its impact on 1.7, it was not backported. But clearly, you found a way to hit it.

The fix is here: http://git.haproxy.org/?p=haproxy.git;a=commit;h=c5a9d5bf

I will see with Willy to backport it soon. Many thanks !

Great, looking forward to 1.7.11. I’m ok with building 1.7.10 + the backported patch. Thanks!

@Victor, thanks. It could be nice to validate the fix yes.

Hi @Victor. Did you have any time to test the patch to validate the fix ?

Sorry Christopher, I assumed the patch is for 1.8 and needs a bit of porting to 1.7…
Just made a fresh new rpm with the patch. Will deploy soon. I guess we’ll need another few quiet days to hopefully declare victory.

No pb. Feedback is important and really valuable on this kind of issue :slight_smile: I prefer to wait some more days than pushing the wrong fix.

Thanks

Ok, the patched 1.7.10 is up and running. I’ll monitor for sR--.

Hi @Victor. Any news ? what about sR--.

Hi Christopher, no sR-- since the upgrade to the patched 1.7.10, so I guess the patch is OK.
Thanks for the help!