Haproxy 1.8.4 segfault at 8 ip 0000000000475cb2


#1

I have a problem with haproxy in multiprocess mode, when i enable h2 with directive " alpn h2,http/1.1 " the haproxy randomly thorow segfault error and stop working . I have more than 10 servers with this setup and all have this same problem.

[1811922.431684] haproxy[19168]: segfault at 8 ip 0000000000475cb2 sp 00007ffe0d869a20 error 4 in haproxy[400000+12b000]
[1816536.073834] haproxy[46776]: segfault at 8 ip 0000000000475cb2 sp 00007ffd8bc48f40 error 4 in haproxy[400000+12b000]

Haproxy has been comiled with :

ibkadm5-1.15.1-8.el7.x86_64,libsepol-devel-2.5-6.el7.x86_64,pcre-devel-8.32-17.el7.x86_64,libselinux-devel-2.5-11.el7.x86_64,libcom_err-devel-1.42.9-10.el7.x86_64,libverto-devel-0.2.5-4.el7.x86_64,keyutils-libs-devel-1.5.8-3.el7.x86_64,krb5-devel-1.15.1-8.el7.x86_64, zlib-devel-1.2.7-17.el7.x86_64
,openssl-devel-1.0.2k-8.el7.x86_64,glibc-2.17-196.el7_4.2.x86_64

I’ ve comiled haroxy with debug symbols and enabled coredump
This is full backtrace from gdb :

bt full
#0 0x0000000000475cb2 in h2_process_mux (h2c=0x2495650) at src/mux_h2.c:2107
h2s = 0x38ff770
h2s_back = 0x2c49f30
#1 h2_send (conn=) at src/mux_h2.c:2225
flags =
h2c = 0x2495650
done =
#2 0x00000000004c65b2 in conn_fd_handler (fd=) at src/connection.c:128
conn = 0x2acb6c0
flags = 0
#3 0x00000000004d4cd3 in fd_process_cached_events () at src/fd.c:271
fd = 2119
entry = 0
e =
#4 0x0000000000486c9f in run_poll_loop () at src/haproxy.c:2425
next =
exp =
#5 run_thread_poll_loop (data=data@entry=0x1db8620) at src/haproxy.c:2460
ptif = 0x734110 <per_thread_init_list>
ptdf =
start_lock = 0
#6 0x000000000040a1db in main (argc=, argv=) at src/haproxy.c:3049
tids = 0x1db8620
threads = 0x1da2f60
i =
err =
retry =
limit = {rlim_cur = 500144, rlim_max = 500144}
errmsg = “\000\000\000\000w”, ‘\000’ <repeats 11 times>, "O\001\062\036\376\177\000\000\240fP\000\000\000\000\000\v\330P\000\000\000\000\000\v\330P\000\000\000\000\000|\000\000\000\000\000\000\000g\371\324f\177\000\000\340ӯ\001\000\000\000\000g\371\324f\177\000\000 ˯\001\000\000\000\000pk\001\000\000\000\000\000\200\000\000"
pidfd =

HA-Proxy version 1.8.4-1deb90d 2018/02/08
Copyright 2000-2018 Willy Tarreau willy@haproxy.org

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label -DTCP_USER_TIMEOUT=18
OPTIONS = USE_LINUX_TPROXY=1 USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace


#2

Please confirm that you are using mutliprocess mode (nbproc), not threading (nbthread).

There are a few bugfixes related to threading and H2 after v1.8.4. Can you try the current 1.8 tree?

You can directly clone from git here:
http://git.haproxy.org/git/haproxy-1.8.git/

Or use the tarball with the snapshot from February 27th:
http://www.haproxy.org/download/1.8/src/snapshot/haproxy-ss-20180227.tar.gz

If it still crashes we are gonna need @willy to take a look at this stacktrace.


#3

I use multiprocess mode, this is my global config section :

global
log /dev/log local0
log /dev/log local1 notice
user haproxy
group haproxy
master-worker
stats socket /var/lib/haproxy/stats1 level admin process 1
stats socket /var/lib/haproxy/stats2 level admin process 2
stats socket /var/lib/haproxy/stats3 level admin process 3
stats socket /var/lib/haproxy/stats4 level admin process 4
stats socket /var/lib/haproxy/stats5 level admin process 5
stats socket /var/lib/haproxy/stats6 level admin process 6
stats socket /var/lib/haproxy/stats7 level admin process 7
stats socket /var/lib/haproxy/stats8 level admin process 8

ssl-default-bind-options no-sslv3 prefer-client-ciphers
ssl-default-server-options  no-sslv3
ssl-default-bind-ciphers ECDHE-RSA-AES12.....................

nbproc 8
tune.ssl.default-dh-param   4096

Ok, I try the current tree


#4

Hi, i’ve compiled current tree and problem still occurs

(gdb) bt full
#0 0x0000000000475ec2 in h2_process_mux (h2c=0x292b8f0) at src/mux_h2.c:2107
h2s = 0x31f5b90
h2s_back = 0x3e61e68
#1 h2_send (conn=) at src/mux_h2.c:2225
flags =
h2c = 0x292b8f0
done =
#2 0x00000000004c67d2 in conn_fd_handler (fd=) at src/connection.c:128
conn = 0x2dad2f0
flags = 0
#3 0x00000000004d4ef3 in fd_process_cached_events () at src/fd.c:271
fd = 2144
entry = 0
e =
#4 0x0000000000486ebf in run_poll_loop () at src/haproxy.c:2426
next =
exp =
#5 run_thread_poll_loop (data=data@entry=0x24eb700) at src/haproxy.c:2461
ptif = 0x734110 <per_thread_init_list>
ptdf =
start_lock = 0
#6 0x000000000040a1db in main (argc=, argv=) at src/haproxy.c:3050
tids = 0x24eb700
threads = 0x24610f0
i =
err =
retry =
limit = {rlim_cur = 500145, rlim_max = 500145}
errmsg = “\000\000\000\000w”, ‘\000’ <repeats 11 times>, "/G%\203\377\177\000\000iP\000\000\000\000\000\313\332P\000\000\000\000\000\313\332P\000\000\000\000\000|\000\000\000\000\000\000\000GP\b\223\177\000\000\340S\031\002\000\000\000\000`GP\b\223\177\000\000 K\031\002\000\000\000\000pk\001\000\000\000\000\000\200\000\000"
pidfd =


#5

@willy please review this trace coming from the H2 code.


#6

I’m currently out of any idea, it’s the very first time we get a report at this location and it’s unclear to me what is not valid there : h2s->cs->data_cb->send(h2s->cs);

@sawekc, could you please type this under gdb :

p *h2s
p *h2s->cs

I suspect that a pointer is corrupted but I don’t understand how. My guess would be that cs was freed without being destroyed, though I don’t see where this is possible.


#7

(gdb) p *h2s
$1 = {cs = 0x2d3d1d0, h2c = 0x292b8f0, req = {state = HTTP_MSG_RQBEFORE, status = 0, flags = 0, curr_len = 0, body_len = 0, err_pos = 0, err_state = 0}, res = {state = HTTP_MSG_BODY, status = 200, flags = 1, curr_len = 532003, body_len = 0, err_pos = 0, err_state = 0}, by_id = {node = {branches = {b = {
0x476d20 <h2_timeout_task>, 0x3c97150}}, node_p = 0x3458fd1, leaf_p = 0x47265e1, bit = 8977, pfx = 964}, key = 0}, list = {n = 0x3e61ee0, p = 0xf4f2b4f402920004}, id = -185420556, flags = 12293, mws = 1, errcode = H2_ERR_NO_ERROR, st = H2_SS_IDLE}
(gdb) p *h2s->cs
$2 = {obj_type = 208, conn = 0x292b8f0, flags = 0, data = 0x0, data_cb = 0x0, ctx = 0x0


#8

Thanks! No data layer??? I have to scratch my head to figure how this can happen now.


#9

In fact h2s->cs looks very similar to h2s! obj_type 208 doesn’t exist and precisely matches 0xd0 in the cs. Then the conn, which is the second pointer, is in fact a pointer to the h2c (assuming the h2s is correct), which is not granted yet given some other elements. But res looks valid thus it tends to imply that the first part could possibly be as well.

Then it would mean that h2s->cs is wrong and pointing to h2s instead. I’m not seeing anywhere in the code where this could accidently happen, and the h2s and conn_stream struct are sufficiently different in size not to be merged into the same pool.

I’d be interested in a few extra commands :

p h2s
p cs
x/152xb h2s
x/48xb h2s->cs

To be honnest at this point I’m starting to suspect that an H2 stream was either deleted without being removed from the list, or was inserted while it was already there, breaking the send list. I’m not seeing any matching places for now.


#10

(gdb) p h2s
$3 = (struct h2s *) 0x31f5b90
(gdb) p cs
No symbol “cs” in current context.
(gdb) x/152xb h2s
0x31f5b90: 0xd0 0xd1 0xd3 0x02 0x00 0x00 0x00 0x00
0x31f5b98: 0xf0 0xb8 0x92 0x02 0x00 0x00 0x00 0x00
0x31f5ba0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5ba8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5bb0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5bb8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5bc0: 0x1b 0x00 0xc8 0x00 0x01 0x00 0x00 0x00
0x31f5bc8: 0x23 0x1e 0x08 0x00 0x00 0x00 0x00 0x00
0x31f5bd0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5bd8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5be0: 0x20 0x6d 0x47 0x00 0x00 0x00 0x00 0x00
0x31f5be8: 0x50 0x71 0xc9 0x03 0x00 0x00 0x00 0x00
0x31f5bf0: 0xd1 0x8f 0x45 0x03 0x00 0x00 0x00 0x00
0x31f5bf8: 0xe1 0x65 0x72 0x04 0x00 0x00 0x00 0x00
0x31f5c00: 0x11 0x23 0xc4 0x03 0x00 0x00 0x00 0x00
0x31f5c08: 0xe0 0x1e 0xe6 0x03 0x00 0x00 0x00 0x00
0x31f5c10: 0x04 0x00 0x92 0x02 0xf4 0xb4 0xf2 0xf4
0x31f5c18: 0xf4 0xb4 0xf2 0xf4 0x05 0x30 0x00 0x00
0x31f5c20: 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00
(gdb) x/48xb h2s->cs
0x2d3d1d0: 0xd0 0x41 0x8b 0x04 0x00 0x00 0x00 0x00
0x2d3d1d8: 0xf0 0xb8 0x92 0x02 0x00 0x00 0x00 0x00
0x2d3d1e0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x2d3d1e8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x2d3d1f0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x2d3d1f8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00


#11

Thank you. So it’s clear that something happened, causing the h2s to be copied over the cs, but for now I cannot spot how this is even possible.


#12

@slawekc, even after carefully reading the code I’m not seeing anything that could explain this. However, Christopher recently fixed a bug that we initially underestimated :

commit 929b52d8a1a5d3271fa8b92b4a6e0f148e3a1d42
Author: Christopher Faulet <cfaulet@haproxy.com>
Date:   Mon Feb 26 13:43:38 2018 +0100

BUG/MINOR: h2: Set the target of dbuf_wait to h2c

In h2_get_dbuf, when the buffer allocation was failing, dbuf_wait.target was
errornously set to the connection (h2c->conn) instead of the h2 connection
descriptor (h2c).

This patch must be backported to 1.8.

It’s not immediately clear to me whether it can cause this or not, but the relation between the two is striking. This patch is already in latest 1.8, so I’d like that you test 1.8.5 once it’s out to see if we can consider this issue resolved or not.


#13

Ok, I try this patch, thx


#14

Hi, i’ve test the new version (1.8.5) and problem stilll occurs ,
Currently segfault message :

[3952008.175803] haproxy[17651]: segfault at 8 ip 0000000000475f52 sp 00007ffed26ce790 error 4 in haproxy[400000+12c000]

Backtrace:

> Core was generated by `/usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x0000000000475f52 in h2_process_mux (h2c=0x3085880) at src/mux_h2.c:2125
> 2125                            h2s->cs->data_cb->send(h2s->cs);
> Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7_4.2.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libselinux-2.5-11.el7.x86_64 nss-softokn-freebl-3.28.3-8.el7_4.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-17.el7.x86_64
> (gdb) bt full
> #0  0x0000000000475f52 in h2_process_mux (h2c=0x3085880) at src/mux_h2.c:2125
>         h2s = 0x29f1710
>         h2s_back = 0x3085890
> #1  h2_send (conn=<optimized out>) at src/mux_h2.c:2243
>         flags = <optimized out>
>         h2c = 0x3085880
>         done = <optimized out>
> #2  0x00000000004c68c2 in conn_fd_handler (fd=<optimized out>) at src/connection.c:128
>         conn = 0x2405dd0
>         flags = 0
> #3  0x00000000004d5303 in fd_process_cached_events () at src/fd.c:271
>         fd = 245
>         entry = 0
>         e = <optimized out>
> #4  0x0000000000486f7f in run_poll_loop () at src/haproxy.c:2426
>         next = <optimized out>
>         exp = <optimized out>
> #5  run_thread_poll_loop (data=data@entry=0x14cb190) at src/haproxy.c:2461
>         ptif = 0x734110 <per_thread_init_list>
>         ptdf = <optimized out>
>         start_lock = 0
> #6  0x000000000040a1d3 in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3050
>         tids = 0x14cb190
>         threads = 0x15659d0
>         i = <optimized out>
>         err = <optimized out>
>         retry = <optimized out>
>         limit = {rlim_cur = 500145, rlim_max = 500145}
>         errmsg = "\000\000\000\000w", '\000' <repeats 11 times>, "\357%\306c\375\177\000\000\226mP\000\000\000\000\000\v\337P\000\000\000\000\000\v\337P\000\000\000\000\000|\000\000\000\000\000\000\000`'%$N\177\000\000\340\263\037\001\000\000\000\000`'%$N\177\000\000 \253\037\001\000\000\000\000pk\001\000\000\000\000\000\200\000\000"
>         pidfd = <optimized out>
> 
> 
> (gdb) p *h2s
> $1 = {cs = 0x1f11720, h2c = 0x3085880, req = {state = HTTP_MSG_RQBEFORE, status = 0, flags = 0, curr_len = 0, body_len = 0, err_pos = 0, err_state = 0}, res = {state = HTTP_MSG_BODY, status = 200, flags = 1, curr_len = 6253, body_len = 38751, err_pos = 0, err_state = 0}, by_id = {node = {branches = {b = {
>           0x456c00 <process_stream>, 0x27f98f0}}, node_p = 0x0, leaf_p = 0x0, bit = 0, pfx = 0}, key = 23}, list = {n = 0x3085908, p = 0x3085908}, id = 23, flags = 12293, mws = 12550414, errcode = H2_ERR_CANCEL, st = H2_SS_CLOSED}
> (gdb) p *h2s->cs
> $2 = {obj_type = 128, conn = 0x23353b1, flags = 0, data = 0x0, data_cb = 0x4, ctx = 0x1}

Also, i do some test, i’ve reduce weight for one of the haproxy node in the cluster and then node who less traffice than other nodes then this node have less segfault errors .


#15

Thank you for the report and sorry that the issue is still there :frowning:

I’ve identified a potential race that may happen with H2 on connection timeouts only when threads are used. I’ll work on it now but at the moment I don’t see how it could manifest itself to cause this. So we definitely have several errors for now :-/

I’m continuing to dig. Thanks!


#16

Just FYI, I found the cause and now have a fix for it. I’m now trying to verify the h2+threads race I think I identified, then to fix it, and I’ll issue 1.8.6 with all this fixed.


#17

If this can help, today I’ve got the same segfault with haproxy 1.8.4:

Apr 10 17:14:35 haproxy1 kernel: haproxy[5773]: segfault at 8 ip 00007f85025af3e8 sp 00007f84f9437a80 error 4 in haproxy[7f850249f000+16e000]
Apr 10 17:14:36 haproxy1 haproxy: [ALERT] 098/134312 (5763) : Current worker 5764 exited with code 139
Apr 10 17:14:36 haproxy1 haproxy: [ALERT] 098/134312 (5763) : exit-on-failure: killing every workers with SIGTERM
Apr 10 17:14:36 haproxy1 haproxy: [WARNING] 098/134312 (5763) : All workers exited. Exiting… (139)
Apr 10 17:14:36 haproxy1 systemd: haproxy.service: main process exited, code=exited, status=139/n/a
Apr 10 17:14:36 haproxy1 systemd: Unit haproxy.service entered failed state.
Apr 10 17:14:36 haproxy1 systemd: haproxy.service failed.

I’m using:

nbproc 1
nbthread 12
cpu-map auto:1/1-12 0-11
alpn h2,http/1.1

The haproxy info is:

HA-Proxy version 1.8.4-1deb90d 2018/02/08
Copyright 2000-2018 Willy Tarreau willy@haproxy.org

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label
OPTIONS = USE_LINUX_TPROXY=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_SYSTEMD=1 USE_PCRE=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.4
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace


#18

Update haproxy to 1.8.6 , that’s version has fixed this bug.


#19

@willy - Thanks, now haproxy works fine.


#20

Many thanks for confirming!