HAProxy community

Haproxy 1.8.4 segfault at 8 ip 0000000000475cb2


#7

(gdb) p *h2s
$1 = {cs = 0x2d3d1d0, h2c = 0x292b8f0, req = {state = HTTP_MSG_RQBEFORE, status = 0, flags = 0, curr_len = 0, body_len = 0, err_pos = 0, err_state = 0}, res = {state = HTTP_MSG_BODY, status = 200, flags = 1, curr_len = 532003, body_len = 0, err_pos = 0, err_state = 0}, by_id = {node = {branches = {b = {
0x476d20 <h2_timeout_task>, 0x3c97150}}, node_p = 0x3458fd1, leaf_p = 0x47265e1, bit = 8977, pfx = 964}, key = 0}, list = {n = 0x3e61ee0, p = 0xf4f2b4f402920004}, id = -185420556, flags = 12293, mws = 1, errcode = H2_ERR_NO_ERROR, st = H2_SS_IDLE}
(gdb) p *h2s->cs
$2 = {obj_type = 208, conn = 0x292b8f0, flags = 0, data = 0x0, data_cb = 0x0, ctx = 0x0


#8

Thanks! No data layer??? I have to scratch my head to figure how this can happen now.


#9

In fact h2s->cs looks very similar to h2s! obj_type 208 doesn’t exist and precisely matches 0xd0 in the cs. Then the conn, which is the second pointer, is in fact a pointer to the h2c (assuming the h2s is correct), which is not granted yet given some other elements. But res looks valid thus it tends to imply that the first part could possibly be as well.

Then it would mean that h2s->cs is wrong and pointing to h2s instead. I’m not seeing anywhere in the code where this could accidently happen, and the h2s and conn_stream struct are sufficiently different in size not to be merged into the same pool.

I’d be interested in a few extra commands :

p h2s
p cs
x/152xb h2s
x/48xb h2s->cs

To be honnest at this point I’m starting to suspect that an H2 stream was either deleted without being removed from the list, or was inserted while it was already there, breaking the send list. I’m not seeing any matching places for now.


#10

(gdb) p h2s
$3 = (struct h2s *) 0x31f5b90
(gdb) p cs
No symbol “cs” in current context.
(gdb) x/152xb h2s
0x31f5b90: 0xd0 0xd1 0xd3 0x02 0x00 0x00 0x00 0x00
0x31f5b98: 0xf0 0xb8 0x92 0x02 0x00 0x00 0x00 0x00
0x31f5ba0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5ba8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5bb0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5bb8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5bc0: 0x1b 0x00 0xc8 0x00 0x01 0x00 0x00 0x00
0x31f5bc8: 0x23 0x1e 0x08 0x00 0x00 0x00 0x00 0x00
0x31f5bd0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5bd8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x31f5be0: 0x20 0x6d 0x47 0x00 0x00 0x00 0x00 0x00
0x31f5be8: 0x50 0x71 0xc9 0x03 0x00 0x00 0x00 0x00
0x31f5bf0: 0xd1 0x8f 0x45 0x03 0x00 0x00 0x00 0x00
0x31f5bf8: 0xe1 0x65 0x72 0x04 0x00 0x00 0x00 0x00
0x31f5c00: 0x11 0x23 0xc4 0x03 0x00 0x00 0x00 0x00
0x31f5c08: 0xe0 0x1e 0xe6 0x03 0x00 0x00 0x00 0x00
0x31f5c10: 0x04 0x00 0x92 0x02 0xf4 0xb4 0xf2 0xf4
0x31f5c18: 0xf4 0xb4 0xf2 0xf4 0x05 0x30 0x00 0x00
0x31f5c20: 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00
(gdb) x/48xb h2s->cs
0x2d3d1d0: 0xd0 0x41 0x8b 0x04 0x00 0x00 0x00 0x00
0x2d3d1d8: 0xf0 0xb8 0x92 0x02 0x00 0x00 0x00 0x00
0x2d3d1e0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x2d3d1e8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x2d3d1f0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x2d3d1f8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00


#11

Thank you. So it’s clear that something happened, causing the h2s to be copied over the cs, but for now I cannot spot how this is even possible.


#12

@slawekc, even after carefully reading the code I’m not seeing anything that could explain this. However, Christopher recently fixed a bug that we initially underestimated :

commit 929b52d8a1a5d3271fa8b92b4a6e0f148e3a1d42
Author: Christopher Faulet <cfaulet@haproxy.com>
Date:   Mon Feb 26 13:43:38 2018 +0100

BUG/MINOR: h2: Set the target of dbuf_wait to h2c

In h2_get_dbuf, when the buffer allocation was failing, dbuf_wait.target was
errornously set to the connection (h2c->conn) instead of the h2 connection
descriptor (h2c).

This patch must be backported to 1.8.

It’s not immediately clear to me whether it can cause this or not, but the relation between the two is striking. This patch is already in latest 1.8, so I’d like that you test 1.8.5 once it’s out to see if we can consider this issue resolved or not.


#13

Ok, I try this patch, thx


#14

Hi, i’ve test the new version (1.8.5) and problem stilll occurs ,
Currently segfault message :

[3952008.175803] haproxy[17651]: segfault at 8 ip 0000000000475f52 sp 00007ffed26ce790 error 4 in haproxy[400000+12c000]

Backtrace:

> Core was generated by `/usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x0000000000475f52 in h2_process_mux (h2c=0x3085880) at src/mux_h2.c:2125
> 2125                            h2s->cs->data_cb->send(h2s->cs);
> Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7_4.2.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libselinux-2.5-11.el7.x86_64 nss-softokn-freebl-3.28.3-8.el7_4.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-17.el7.x86_64
> (gdb) bt full
> #0  0x0000000000475f52 in h2_process_mux (h2c=0x3085880) at src/mux_h2.c:2125
>         h2s = 0x29f1710
>         h2s_back = 0x3085890
> #1  h2_send (conn=<optimized out>) at src/mux_h2.c:2243
>         flags = <optimized out>
>         h2c = 0x3085880
>         done = <optimized out>
> #2  0x00000000004c68c2 in conn_fd_handler (fd=<optimized out>) at src/connection.c:128
>         conn = 0x2405dd0
>         flags = 0
> #3  0x00000000004d5303 in fd_process_cached_events () at src/fd.c:271
>         fd = 245
>         entry = 0
>         e = <optimized out>
> #4  0x0000000000486f7f in run_poll_loop () at src/haproxy.c:2426
>         next = <optimized out>
>         exp = <optimized out>
> #5  run_thread_poll_loop (data=data@entry=0x14cb190) at src/haproxy.c:2461
>         ptif = 0x734110 <per_thread_init_list>
>         ptdf = <optimized out>
>         start_lock = 0
> #6  0x000000000040a1d3 in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3050
>         tids = 0x14cb190
>         threads = 0x15659d0
>         i = <optimized out>
>         err = <optimized out>
>         retry = <optimized out>
>         limit = {rlim_cur = 500145, rlim_max = 500145}
>         errmsg = "\000\000\000\000w", '\000' <repeats 11 times>, "\357%\306c\375\177\000\000\226mP\000\000\000\000\000\v\337P\000\000\000\000\000\v\337P\000\000\000\000\000|\000\000\000\000\000\000\000`'%$N\177\000\000\340\263\037\001\000\000\000\000`'%$N\177\000\000 \253\037\001\000\000\000\000pk\001\000\000\000\000\000\200\000\000"
>         pidfd = <optimized out>
> 
> 
> (gdb) p *h2s
> $1 = {cs = 0x1f11720, h2c = 0x3085880, req = {state = HTTP_MSG_RQBEFORE, status = 0, flags = 0, curr_len = 0, body_len = 0, err_pos = 0, err_state = 0}, res = {state = HTTP_MSG_BODY, status = 200, flags = 1, curr_len = 6253, body_len = 38751, err_pos = 0, err_state = 0}, by_id = {node = {branches = {b = {
>           0x456c00 <process_stream>, 0x27f98f0}}, node_p = 0x0, leaf_p = 0x0, bit = 0, pfx = 0}, key = 23}, list = {n = 0x3085908, p = 0x3085908}, id = 23, flags = 12293, mws = 12550414, errcode = H2_ERR_CANCEL, st = H2_SS_CLOSED}
> (gdb) p *h2s->cs
> $2 = {obj_type = 128, conn = 0x23353b1, flags = 0, data = 0x0, data_cb = 0x4, ctx = 0x1}

Also, i do some test, i’ve reduce weight for one of the haproxy node in the cluster and then node who less traffice than other nodes then this node have less segfault errors .


#15

Thank you for the report and sorry that the issue is still there :frowning:

I’ve identified a potential race that may happen with H2 on connection timeouts only when threads are used. I’ll work on it now but at the moment I don’t see how it could manifest itself to cause this. So we definitely have several errors for now :-/

I’m continuing to dig. Thanks!


#16

Just FYI, I found the cause and now have a fix for it. I’m now trying to verify the h2+threads race I think I identified, then to fix it, and I’ll issue 1.8.6 with all this fixed.


#17

If this can help, today I’ve got the same segfault with haproxy 1.8.4:

Apr 10 17:14:35 haproxy1 kernel: haproxy[5773]: segfault at 8 ip 00007f85025af3e8 sp 00007f84f9437a80 error 4 in haproxy[7f850249f000+16e000]
Apr 10 17:14:36 haproxy1 haproxy: [ALERT] 098/134312 (5763) : Current worker 5764 exited with code 139
Apr 10 17:14:36 haproxy1 haproxy: [ALERT] 098/134312 (5763) : exit-on-failure: killing every workers with SIGTERM
Apr 10 17:14:36 haproxy1 haproxy: [WARNING] 098/134312 (5763) : All workers exited. Exiting… (139)
Apr 10 17:14:36 haproxy1 systemd: haproxy.service: main process exited, code=exited, status=139/n/a
Apr 10 17:14:36 haproxy1 systemd: Unit haproxy.service entered failed state.
Apr 10 17:14:36 haproxy1 systemd: haproxy.service failed.

I’m using:

nbproc 1
nbthread 12
cpu-map auto:1/1-12 0-11
alpn h2,http/1.1

The haproxy info is:

HA-Proxy version 1.8.4-1deb90d 2018/02/08
Copyright 2000-2018 Willy Tarreau willy@haproxy.org

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label
OPTIONS = USE_LINUX_TPROXY=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_SYSTEMD=1 USE_PCRE=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips 26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.4
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity(“identity”), deflate(“deflate”), raw-deflate(“deflate”), gzip(“gzip”)
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace


#18

Update haproxy to 1.8.6 , that’s version has fixed this bug.


#19

@willy - Thanks, now haproxy works fine.


#20

Many thanks for confirming!


#21

Perfect, thanks for the info!


#22

Hi, I have this same error with version 1.9.0


#23

I suspect you’re simply facing another segfault which has nothing to do with this one, given that 1) the initial bug was fixed long ago, and 2) the 1.8 and 1.9 code are very different. Feel free to run gdb and “bt full”, that may help. We’ve fixed some issues in latest master already, and 1.9.1 will have many of them fixed.


#24

Downgraded to 1.8.16 and the segmentation fault does not occur.

@lukastribus In fact it is a very similar crash, this is from my log:

haproxy[29004]: segfault at 0 ip 000000000045ac72 sp 00007ffdad538050 error 4 in haproxy[400000+181000]

In my case, it happens only with h2 activated in 1.9.0.

BTW, I am not looking for help troubleshooting… Just publish what I found. I already resolved with 1.8.16 and is working OK.


#25

Like Willy explained, this is not the same crash. This thread is about a crash bug that has been fixed a long time ago. Just because you see “segfault” in the dmesg, does not mean it’s the same crash.

If you want to help troubleshooting this issue, please refer to the following thread:

Thanks


#26