So I just noticed the issue happening again, this time on the proxy instance used for MySQL/Galera instead of the Redis one. As before, connections to the unix socket (actually both as there are two frontends in the MySQL proxy) are refused but connections to TCP port are accepted and work fine. Here is what I’ve gathered, I will keep it running for now in case you can think of something else to look for:
Screenshot of stats page: https://www.screencast.com/t/oXhh22uc
On the container:
# ls -al /run/magento/galera-*.sock
srwxrwxrwx 1 root root 0 Apr 5 12:32 galera-read.sock
srwxrwxrwx 1 root root 0 Apr 5 12:32 galera-write.sock
# stat /run/magento/galera-read.sock
File: /run/magento/galera-read.sock
Size: 0 Blocks: 0 IO Block: 4096 socket
Device: 13h/19d Inode: 2730 Links: 1
Access: (0777/srwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2017-04-05 12:32:51.344687784 +0000
Modify: 2017-04-05 12:32:51.344687784 +0000
Change: 2017-04-05 12:32:51.344687784 +0000
Birth: -
ps aux on the container:
PID USER TIME COMMAND
6740 root 0:00 /usr/local/sbin/haproxy-master
6741 root 2:27 /usr/local/sbin/haproxy -p /run/haproxy.pid -f /etc/haproxy.cfg -Ds -sf 6663
lsof on the container:
6740 /usr/local/sbin/haproxy pipe:[1878095]
6740 /usr/local/sbin/haproxy pipe:[1878096]
6740 /usr/local/sbin/haproxy pipe:[1878097]
6740 /usr/local/sbin/haproxy socket:[1872584]
6740 /usr/local/sbin/haproxy socket:[1872643]
6740 /usr/local/sbin/haproxy anon_inode:[eventpoll]
6740 /usr/local/sbin/haproxy socket:[4861265]
6741 /usr/local/sbin/haproxy anon_inode:[eventpoll]
6741 /usr/local/sbin/haproxy socket:[1872584]
6741 /usr/local/sbin/haproxy socket:[1872643]
6741 /usr/local/sbin/haproxy socket:[4861264]
6741 /usr/local/sbin/haproxy socket:[4861265]
6741 /usr/local/sbin/haproxy socket:[4861266]
6741 /usr/local/sbin/haproxy socket:[4861267]
6741 /usr/local/sbin/haproxy socket:[4861268]
6741 /usr/local/sbin/haproxy socket:[4861269]
The container is running Alpine so it looks like the lsof is from BusyBox which doesn’t tell much so I ran it also on the host (pids are different because of Docker PID mapping):
COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
haproxy 26294 root cwd DIR 0,83 4096 2 /
haproxy 26294 root rtd DIR 0,83 4096 2 /
haproxy 26294 root txt REG 0,83 7322248 83 /usr/local/sbin/haproxy
haproxy 26294 root mem REG 0,83 89 /usr/lib/libpcre.so.1.2.7 (stat: No such file or directory)
haproxy 26294 root mem REG 0,83 87 /lib/libcrypto.so.1.0.0 (stat: No such file or directory)
haproxy 26294 root mem REG 0,83 84 /lib/libssl.so.1.0.0 (stat: No such file or directory)
haproxy 26294 root mem REG 0,83 73 /lib/libz.so.1.2.8 (stat: No such file or directory)
haproxy 26294 root mem REG 0,83 29 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory)
haproxy 26294 root mem REG 0,83 93 /etc/localtime (path dev=8,3, inode=15349327)
haproxy 26294 root 0r FIFO 0,10 0t0 1878095 pipe
haproxy 26294 root 1w FIFO 0,10 0t0 1878096 pipe
haproxy 26294 root 2w FIFO 0,10 0t0 1878097 pipe
haproxy 26294 root 3u sock 0,8 0t0 1872584 protocol: NETLINK
haproxy 26294 root 4u sock 0,8 0t0 1872643 protocol: NETLINK
haproxy 26294 root 5u a_inode 0,11 0 8076 [eventpoll]
haproxy 26294 root 8u sock 0,8 0t0 4861265 protocol: UDP
haproxy 26295 root cwd DIR 0,83 4096 2 /
haproxy 26295 root rtd DIR 0,83 4096 2 /
haproxy 26295 root txt REG 0,83 7322248 83 /usr/local/sbin/haproxy
haproxy 26295 root mem REG 0,83 89 /usr/lib/libpcre.so.1.2.7 (stat: No such file or directory)
haproxy 26295 root mem REG 0,83 87 /lib/libcrypto.so.1.0.0 (stat: No such file or directory)
haproxy 26295 root mem REG 0,83 84 /lib/libssl.so.1.0.0 (stat: No such file or directory)
haproxy 26295 root mem REG 0,83 73 /lib/libz.so.1.2.8 (stat: No such file or directory)
haproxy 26295 root mem REG 0,83 29 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory)
haproxy 26295 root mem REG 0,83 93 /etc/localtime (path dev=8,3, inode=15349327)
haproxy 26295 root 0u a_inode 0,11 0 8076 [eventpoll]
haproxy 26295 root 3u sock 0,8 0t0 1872584 protocol: NETLINK
haproxy 26295 root 4u sock 0,8 0t0 1872643 protocol: NETLINK
haproxy 26295 root 7u sock 0,8 0t0 4861264 protocol: TCP
haproxy 26295 root 8u sock 0,8 0t0 4861265 protocol: UDP
haproxy 26295 root 9u sock 0,8 0t0 4861266 protocol: UNIX
haproxy 26295 root 10u sock 0,8 0t0 4861267 protocol: TCP
haproxy 26295 root 11u sock 0,8 0t0 4861268 protocol: UNIX
haproxy 26295 root 12u sock 0,8 0t0 4861269 protocol: TCP
So I am guessing that 4861266 and 4861268 are the galera-read.sock and galera-write.sock files in which case they appear to be open…
Here are the socket files and their inode numbers:
# ls -ial /run/magento/
total 4
2488 drwxr-xr-x 2 root root 140 Apr 5 19:32 .
20 drwxr-xr-x 3 root root 4096 Apr 3 20:20 ..
2730 srwxrwxrwx 1 root root 0 Apr 5 12:32 galera-read.sock
2729 srwxrwxrwx 1 root root 0 Apr 5 12:32 galera-write.sock
Running this command from the host on the same directory shows the same inode numbers.
I can’t use strace from within the container without restarting it to add --cap-add SYS_PTRACE
(I didn’t know this was needed until I tried it and got ‘Operation permitted error’. However, I can still run it from the host. The master process looks like this:
# strace -p26294
strace: Process 26294 attached
wait4(-1,
And nothing happens when I try to connect. The child process looks like this repeated over and over (there are three nodes to run healthchecks against):
# strace -p26295
strace: Process 26295 attached
epoll_pwait(0, [], 200, 2, NULL, 8) = 0
epoll_pwait(0, [], 200, 2, NULL, 8) = 0
epoll_pwait(0, [], 200, 0, NULL, 8) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 1
fcntl(1, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(1, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(1, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
connect(1, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("10.81.128.49")}, 16) = -1 EINPROGRESS (Operation now in prog
ress)
epoll_pwait(0, [], 200, 0, NULL, 8) = 0
recvfrom(1, 0x55b2d6f6d074, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
getsockopt(1, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(1, "GET / HTTP/1.0\r\n\r\n", 18, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 18
epoll_ctl(0, EPOLL_CTL_ADD, 1, {EPOLLIN|EPOLLRDHUP, {u32=1, u64=1}}) = 0
epoll_pwait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 270, NULL, 8) = 1
recvfrom(1, "HTTP/1.0 200 OK\r\nDate: Wed, 05 A"..., 16384, 0, NULL, NULL) = 151
setsockopt(1, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
close(1) = 0
epoll_pwait(0, [], 200, 267, NULL, 8) = 0
epoll_pwait(0, [], 200, 3, NULL, 8) = 0
epoll_pwait(0, [], 200, 0, NULL, 8) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 1
fcntl(1, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(1, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(1, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
connect(1, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("10.81.128.79")}, 16) = -1 EINPROGRESS (Operation now in prog
ress)
epoll_pwait(0, [], 200, 0, NULL, 8) = 0
recvfrom(1, 0x55b2d6f75cb4, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
getsockopt(1, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(1, "GET / HTTP/1.0\r\n\r\n", 18, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 18
epoll_ctl(0, EPOLL_CTL_ADD, 1, {EPOLLIN|EPOLLRDHUP, {u32=1, u64=1}}) = 0
epoll_pwait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 33, NULL, 8) = 1
recvfrom(1, "HTTP/1.0 200 OK\r\nDate: Wed, 05 A"..., 16384, 0, NULL, NULL) = 151
setsockopt(1, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
close(1) = 0
epoll_pwait(0, [], 200, 30, NULL, 8) = 0
epoll_pwait(0, [], 200, 3, NULL, 8) = 0
epoll_pwait(0, [], 200, 0, NULL, 8) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 1
fcntl(1, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(1, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(1, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
connect(1, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("10.81.128.107")}, 16) = -1 EINPROGRESS (Operation now in pro
gress)
epoll_pwait(0, [], 200, 0, NULL, 8) = 0
recvfrom(1, 0x55b2d6f64434, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
getsockopt(1, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(1, "GET / HTTP/1.0\r\n\r\n", 18, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 18
epoll_ctl(0, EPOLL_CTL_ADD, 1, {EPOLLIN|EPOLLRDHUP, {u32=1, u64=1}}) = 0
epoll_pwait(0, [{EPOLLIN, {u32=1, u64=1}}], 200, 690, NULL, 8) = 1
recvfrom(1, "HTTP/1.0 200 OK\r\nDate: Wed, 05 A"..., 16384, 0, NULL, NULL) = 151
setsockopt(1, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
close(1) = 0
There is a lot of this output so instead of running the socat command just once I ran it about 20 times in a couple seconds and it didn’t seem to add any output to the strace stream when I did so. I also searched the stream for the socket filename and did not find it. In contrast when I ran strace on a working instance of HAProxy and the socat command to test the connection (successfully) here is the result:
epoll_pwait(0, [{EPOLLIN, {u32=9, u64=9}}], 200, 1000, NULL, 8) = 1
accept4(9, {sa_family=AF_LOCAL, NULL}, [2], SOCK_NONBLOCK) = 1
getsockname(1, {sa_family=AF_LOCAL, sun_path="/run/magento/redis-session.sock.59.tmp"}, [41]) = 0
getpid() = 63
accept4(9, 0x7ffcc950a700, 0x7ffcc950a6fc, SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(1, "", 15360, 0, NULL, NULL) = 0
shutdown(1, SHUT_WR) = 0
close(1) = 0
I noticed that the instance that was broken had the ‘-sf’ flag which is expected since the config can be dynamically reloaded as instances are added/removed but when I intentionally triggered this on the working instances they continued to work so I still don’t know how to reproduce the issue.
Here is what the HAProxy logs look like:
docker-entrypoint.sh: Reloading config...
<5>haproxy-systemd-wrapper: re-executing on SIGHUP.
<7>haproxy-systemd-wrapper: executing /usr/local/sbin/haproxy -p /run/haproxy.pid -f /etc/haproxy.cfg -Ds -sf 24885
Proxy stats started.
Proxy galera-write-unix started.
Proxy galera-write-tcp started.
Proxy galera-read-unix started.
Proxy galera-read-tcp started.
Proxy galera-write started.
Proxy galera-read started.
Stopping proxy stats in 0 ms.
Stopping frontend galera-write-unix in 0 ms.
Stopping frontend galera-write-tcp in 0 ms.
Stopping frontend galera-read-unix in 0 ms.
Stopping frontend galera-read-tcp in 0 ms.
Stopping backend galera-write in 0 ms.
Stopping backend galera-read in 0 ms.
Proxy stats stopped (FE: 0 conns, BE: 0 conns).
Proxy galera-write-unix stopped (FE: 0 conns, BE: 0 conns).
Proxy galera-write-tcp stopped (FE: 59 conns, BE: 0 conns).
Proxy galera-read-unix stopped (FE: 0 conns, BE: 0 conns).
Proxy galera-read-tcp stopped (FE: 0 conns, BE: 0 conns).
Proxy galera-write stopped (FE: 0 conns, BE: 59 conns).
Proxy galera-read stopped (FE: 0 conns, BE: 0 conns).
Server galera-write/node14 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and
0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server galera-read/node14 is DOWN via galera-write/node14. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 rema
ining in queue.
Server galera-write/node22 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 1 active and
0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server galera-read/node22 is DOWN via galera-write/node22. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 rema
ining in queue.
Server galera-write/node60 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 2ms. 0 acti
ve and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
backend galera-write has no server available!
Server galera-read/node60 is DOWN via galera-write/node60. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 rema
ining in queue.
backend galera-read has no server available!
Server galera-write/node60 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 3ms. 1 active and 0 backup ser
vers online. 0 sessions requeued, 0 total in queue.
Server galera-read/node60 is UP via galera-write/node60. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
So in summary:
- Proved using socat that the connection refused issue from the host is also an issue from within the container so I don’t see any reason to believe this is a Docker filesystem-related issue. Also the inode numbers match on the container and host.
- HAProxy still didn’t log any errors and the stats page shows that HAProxy seems to think everything is working just fine.
- Still no idea how to reproduce the issue other than wait a long time… Interestingly the other three instances of HAProxy with the exact same healthchecks are still running without the unix socket issue and on the instance that does have the unix socket issue it affects both frontends and not just one of the two.
That’s as much information as I know how to give but let me know if you can think of anything else to check up on and thanks again for the help.
Colin