1.7.11 Compression Issue? Parsing errors on response

Hey

We recently upgraded from 1.7.10 to 1.7.11 and we started to notice some sporadic issues in some of the responses in our application. Specifically some searches via Solr started to fail with what looks to be somewhat garbled responses.

We were at a loss at what was causing this until we took HAProxy out of the equation and issue immediately disappeared. Introducing it back in and then switching off compression also sorted the issue.

So I was just wondering if there was any changes to the compression behavior in 1.7.11? We just use gzip compression and not changed anything on our end. Until this is sorted we’ll revert back to 1.7.10.

Cheers,
R

Can you provide the output of haproxy -vv from both 1.7.10 and the failing 1.7.11, as well as the configuration (without confidential data).

Hey @lukastribus. Here’s the outputs and config:

1.7.11:

HA-Proxy version 1.7.11 2018/04/30
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>
Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow
  OPTIONS = USE_LINUX_TPROXY=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with Lua version : Lua 5.3.4
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
	  epoll : pref=300,  test result OK
	   poll : pref=200,  test result OK
	 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
		[COMP] compression
		[TRACE] trace
		[SPOE] spoe

1.7.10

HA-Proxy version 1.7.10-a7dcc3b 2018/01/02
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv
  OPTIONS = USE_LINUX_TPROXY=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with Lua version : Lua 5.3.4
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
	  epoll : pref=300,  test result OK
	   poll : pref=200,  test result OK
	 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
		[COMP] compression
		[TRACE] trace
		[SPOE] spoe

Sanitized Config:

global
    log 127.0.0.1 local2 info
    maxconn 2000
    tune.ssl.default-dh-param 2048
    tune.maxrewrite 16384
    tune.bufsize 32768
    user admin
    group admin
    stats socket /etc/haproxy/haproxysock level admin

defaults
    log global
    mode http
    compression algo gzip
    compression type text/html text/plain text/javascript application/javascript application/xml text/css image/png font/woff font/woff2
    default-server inter 10s init-addr last,libc,none resolve-prefer ipv4
    option httplog
    option dontlognull
    option redispatch
    option forceclose
    option forwardfor
    retries 3
    timeout connect  5000
    timeout client  160000
    timeout server  160000

resolvers serverdns
    nameserver dnsmasq 127.0.0.1:53

frontend em
    bind *:80
    bind *:443 ssl crt /etc/haproxy/server.lab.local.pem

    acl https ssl_fc
    reqadd X-Forwarded-Proto:\ https if https
    redirect scheme https if !https

    default_backend server_backend

backend server_backend
    mode http
    cookie EXTERNALSERVERID insert indirect nocache
    option httpchk /health
    http-request set-header X-Forwarded-Port %[dst_port]
    http-request set-header X-Forwarded-Proto https if { ssl_fc }

    server server1 server.lab.local:8280 cookie server1 check resolvers serverdns ssl ca-file /etc/haproxy/chain.crt

listen stats
    bind *:1936
    mode http
    balance
    timeout client 160000
    timeout connect 5000
    timeout server 160000
    stats uri /
    stats realm HAProxy\ Statistics
    stats admin if TRUE

Can you find out what transaction is affected and maybe even provide a simpler reproducer?

Hey

Yeah this is a really weird one. This was the error we were seeing:

Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://server.localdomain.local:443/solr/knowledge: parsing error
       at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:553)
       at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)

And an example of the offending response was as follows:

2018-05-25 10:01:36,994 TRACE  << ")PUBLISHED[0xe0]1published_date_dt[0x9][0x0][0x0][0x1]c[0x94]][0xa5][0xf3][0xff][0xc]3UserED.80028_666_-1[0xff][0xf][0x81]?[0xa7][0x1]Our rights are extended and limited by the tools we use. The Internet has magnified our capability for free speech, but has pared down the reasonable expectation of privacy. And we, of course, have [0xff][0x10]#1.0[0xff][0x11](Authored[0xff][0x12]?[\r]0qWq+bj0RHU7ETE/JUD0AuhmHKeMgedfd2lhkgWw7OE=[0xff][0x13][0x8]@E|[0x91][0xa2][0xff][0x1]6iflkdlywww9p8etfi24wk8[0xff][0x2][0xc][0x83]a`[0x8]@=[0xe2]o[0x81][0xb][0xb3][0xff][0x3]6IFlKDlYWww9p8EtFI24wK8[0xff][0x6])KM1000114[0xff][0x7]%en-GB[0xff][0x5]2KnowledgeArticleED[0xff][0xe][0x81]/content_article[0xff][0x9][0x9][0x0][0x0][0x1]c[0x94]a[0x15][0x1b][0xff][0x4]6IFlKDlYWww9p8EtFI24wK8[0xff][0x8]"en[0xff][\r]?.TPG Buys TSL, Owner Of The TES Connec"
2018-05-25 10:01:36,994 TRACE  << "t Education Network, For Around $600M458[0xff][0x14][0x83],product_ipod.topic_extended+region_east[0xff][0xb]3UserED.80028_666_-1[0xff][\n]"

From looking at the above my thinking was that the response was coming back with new lines somehow caused by HAProxy 1.7.11 compression, hence the 2 separate log messages when that response line should’ve been all one message (notice how the last t of the string Owner Of The TES Connect appears at the beginning of the next message). The full response was like the above but over several more lines.

As for a recreation I wasn’t expecting others to do so as this involves my company’s software and a Solr search backend that returns rich text articles along with metadata. When I turned the gzip compression off the issue went away and responses were returned all on one line. So I was more looking for some more understanding of the compression algorithum in 1.7.11, specifically if there had been any changes.

As before, this works fine on 1.7.10 and it only occured for certain articles.

Cheers,
R.

There are some fixes regarding buffers and applets in the 1.7 tree after 1.7.11.

Could you try the latest snapshot in the 1.7 tree (or use git head):
http://www.haproxy.org/download/1.7/src/snapshot/haproxy-ss-20180901.tar.gz

Also, could you try the latest 1.8 stable release? I assume that only 1.7 is affected, otherwise this would have come up more often.

Unfortunately, when we have such hard to reproduce issues, some trial and error cannot be avoided and it makes fixing the bug complicated.

Hey

I’ll look to try the latest snapshot when I can althoughI’m not sure when I can. Maybe in the next week.

I did indeed try the latest 1.8 release at the time and it also didn’t have the issue. However with the 1.8.14 release coming soon (and that other issue fixed) I’ll likely be switching to that once it hit linux repos.

Leave it with me to try with the latest snapshot though, assuming that needs to be built from source.

Yes, if you can, please try latest 1.7 snapshot. I would like to know if this is something that still needs to be fixed, or if it already is fixed. Both 1.6 and 1.7 are still fully supported, so if there is a unfixed bug in there, I’d like to know, confirm and fix it.

Ok will try and so this as soon as I can

1 Like

Apologies for the delay but I had a try of this today and couldn’t get the issue to occur on either snapshot or 1.7.11 weirdly. Our environments have gone through a lot of changes since it happened back in May though so I’ll need to check what the differences are and get back to you.

Ok thanks, we will stand by for more infos from you.

Finally been able to replicate the environments we had before and installed the given latest version (shown below) and I can confirm the issue does still occur with our product.

/usr/local/sbin/haproxy -v
HA-Proxy version 1.7.11-3953bfd 2018/08/31
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

1.7.10 continues to be fine along with the recently released 1.8.14.

Thanks, could you try the following patch (applying it to 1.7.11 or the snapshot - either way is fine):

https://pastebin.com/dl/zJvRs7sJ

It reverts commit 6917b734 (“BUG/MEDIUM: stream-int: Don’t loss write’s notifs when a stream is woken up”), which has been backported to 1.7.11 and I think it could be the one (there are not that many changes in 1.7.11, especially those with impact on compression).

Hey. I applied that patches changes to the haproxy-ss-20180901 tarball and built it but sadly it didn’t remove the problem. Solr still complaining about parsing errors.

Annoying, was hoping it would’ve been the one. So weird, especially since 1.8 is totally fine.

Cheers,
R.

Mmmh ok.

I think it would be best to use git bisect on your side here, we would be able to pinpoint the exact commit easily, but it does requires you spend some time in a few compilation/test cycles.

You’d simply clone the 1.7 repository, start a bisect, mark 1.7.10 good and 1.7.11 bad and then recompile from the suggestions given by git:

git clone http://git.haproxy.org/git/haproxy-1.7.git/
cd haproxy-1.7
git bisect start
git bisect good v1.7.10
git bisect bad v1.7.11

recompile haproxy and test (make clean before your standard make call):
make clean; make TARGET=[...]

test it, if you hit the bug either call:
git bisect bad

or if it worked fine:
git bisect good

This will iterate through the changes between 1.7.10 and 1.7.11 and in the end would give your a bad commit, which is the exact change that caused the issue.

Don’t forget to make clean before each call to make (even when you just applied a single patch).

Alrighty, will give it a go!

1 Like

Well that was wizard. What a cool Git feature!

Here’s the result:

8066ccd3984adb3ceaf63a4d80047f3da4511f13 is the first bad commit
commit 8066ccd3984adb3ceaf63a4d80047f3da4511f13
Author: Christopher Faulet <cfaulet@haproxy.com>
Date:   Fri Feb 2 15:54:15 2018 +0100

BUG/MEDIUM: http: Switch the HTTP response in tunnel mode as earlier as possible

When the body length is undefined (no Content-Length or Transfer-Encoding
headers), The reponse remains in ending mode, waiting the request is done. So,
most of time this is not a problem because the resquest is done before the
response. But when a client sends data to a server that replies without waiting
all the data, it is really not desirable to wait the end of the request to
finish the response.

This bug was introduced when the tunneling of the request and the reponse was
refactored, in commit 4be980391 ("MINOR: http: Switch requests/responses in
TUNNEL mode only by checking txn flag").

This patch should be backported in 1.8 and 1.7.

(cherry picked from commit fd04fcf5edb0a24cd29ce8f4d4dc2aa3a0e2e82c)
Signed-off-by: Willy Tarreau <w@1wt.eu>
(cherry picked from commit 8a5949f2d74c3a3a6c6da25449992c312b183ef3)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

:040000 040000 1d1425ab79b4f083ef563601d95f56f46d90513e deab1b91af05135cfee186bbb0a22acbf1c2d886 M      src

Great, thanks.

Pinging @capflam to see if he has any idea.

Hi,

Thanks @lukastribus for the notification :slight_smile:

@Ryuzaki, could you provide a network capture of a bad response between HAProxy and your server and between the client and HAPRoxy ? I’m a bit puzzled by the commit where the bug seems to be. Because it concerns responses with undefined body length. But for these responses, there is no compression.

Hey

It is weird I agree. I assumed it was compression as the issue did go away when I switched that config off when first investigating a few months ago. I’ll double check today if that did indeed fix the problem or if I was chasing a red herring.

A network capture may take some time. This is a company product so I’ll need to check if I’m always to share a network capture.

Cheers,
R.