I configured SYSLOG to two servers in round-robin, and now I’m trying to work out why HAProxy stops using a server when the receiving service there is restarted. Logs don’t show any up/down activity for a mode log backend.
Firewall/Appliance forwards to logs to HAProxy, which has two tcp servers configured where Elastic Agent parses the data to Elasticsearch. Missing data in Elastic != good…
HAProxy config:
global
log /dev/log local0
log /dev/log local1 notice
defaults
log global
mode http
retries 3
timeout queue 1m
timeout connect 5s
timeout client 50s
timeout server 50s
timeout check 10s
maxconn 3000
log-forward syslog
bind 10.2.2.5:1514
log backend@panoslog user
backend panoslog
mode log
balance roundrobin
default-server inter 1s fastinter 500ms fall 3 rise 2 on-error mark-down log-bufsize 8388608
server log0 10.2.2.51:9001 log-proto octet-count check
server log1 10.2.2.52:9001 log-proto octet-count check
This same HAProxy instance has stats enabled and load-balances MySQL without missing a beat.
Nothing appears in the logs about the backend servers. I added bufsize and on-error to see if they would help mitigate the missing data.
haproxy version is 3.0.12-1ppa1~jammy
Restarting HAProxy fixes it until the receiving server is reloaded. It appears that HAProxy continues to send data to the ringbuffer, but fails to forward it to the server and then drops it?! I can’t prove this as all I have is seeing less events in Elastic when a server is affected.
Using tcpdump I can see that the server responds with a tcp RESET to received SYN packets until HAProxy is restarted. If I add fastinter 500ms observe layer4 error-limit 2 then the server is detected as down, but quickly comes back up while it’s still sending tcp RESET to HAProxy’s SYN packets.
Oct 08 18:17:17 ha.domain.com haproxy[285424]: [WARNING] (285424) : Server panoslog/log1 is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 0ms. 1 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
Oct 08 18:17:17 ha.domain.com haproxy[285424]: Server panoslog/log1 is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 0ms. 1 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
Oct 08 18:17:17 ha.domain.com haproxy[285424]: Server panoslog/log1 is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 0ms. 1 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
Oct 08 18:17:20 ha.domain.com haproxy[285424]: [WARNING] (285424) : Server panoslog/log1 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Oct 08 18:17:20 ha.domain.com haproxy[285424]: Server panoslog/log1 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Oct 08 18:17:20 ha.domain.com haproxy[285424]: Server panoslog/log1 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
If this is an issue with Elastic Agent, then why can it be fixed by restarting HAProxy? If it’s HAProxy, then why isn’t the server marked and kept down when it sends a tcp RESET back?