I am in a process of creating Lua script to check lag between two different mysql instances and to failover to the other instance when first the primary is unavailable or the replication lag is high. HAProxy is running as a Kubernetes deployment. I am facing random container restarts due to the following error once or twice in a day:
Connection: MySQL connection (0x55a65e962118)
Checking lag for tidb-primary conn: MySQL connection (0x55a65e962118)
primary lag is:99141
Connection: MySQL connection (0x55a65e96a968)
Checking lag for tidb-secondary conn: MySQL connection (0x55a65e96a968)
Thread 4 is about to kill the process.
>Thread 1 : id=0x7f4d3c214200 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
stuck=1 prof=1 harmless=0 wantrdv=0
cpu_ns: poll=13486920498 now=15196499898 diff=1709579400
curr_task=0x55a65e920a00 (task) calls=1 last=1712927637 ns ago
fct=0x55a65e440590(process_stream) ctx=0x55a65e9fc330
strm=0x55a65e9fc330 src=10.1.136.232 fe=tidb be=tidb dst=unknown
rqf=908021 rqa=0 rpf=80000000 rpa=0 sif=EST,80020 sib=INI,30
af=(nil),0 csf=0x55a65e9fc240,9000
ab=(nil),0 csb=(nil),0
cof=0x55a65e935920,80041300:PASS(0x55a65e9df2b0)/RAW((nil))/tcpv4(28)
cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Current executing Lua from a stream analyser --
call trace(17):
| 0x55a65e4b427b [89 c5 85 c0 0f 85 db 01]: ha_thread_dump+0x24b/0x448
| 0x55a65e4b44e6 [48 8b 05 c3 6c 22 00 48]: debug_handler+0x66/0x10e
| 0x7f4d3c9af730 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12730
| 0x55a65e3bb062 [e9 79 ff ff ff 66 0f 1f]: hlua_ctx_init+0xb2/0x186
| 0x55a65e3bd871 [85 c0 0f 84 3f 01 00 00]: main+0x32211
| 0x55a65e456eed [85 c0 0f 84 8b 00 00 00]: sample_process+0x4d/0x127
| 0x55a65e457a0c [48 85 c0 74 3f 64 48 63]: sample_fetch_as_type+0x3c/0x9b
| 0x55a65e485baa [48 85 c0 0f 84 e4 15 00]: sess_build_logline+0x204a/0x3b39
| 0x55a65e43949b [85 c0 75 20 48 8b 3d 62]: main+0xade3b
| 0x55a65e4424c9 [85 c0 0f 85 b3 f2 ff ff]: process_stream+0x1f39/0x4f76
>Thread 2 : id=0x7f4d3c1f0700 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
stuck=1 prof=1 harmless=0 wantrdv=0
cpu_ns: poll=12105320018 now=14481791728 diff=2376471710
curr_task=0x7f4d34017330 (task) calls=1 last=2380083314 ns ago
fct=0x55a65e440590(process_stream) ctx=0x7f4d34017560
strm=0x7f4d34017560 src=10.1.136.232 fe=tidb be=tidb dst=unknown
rqf=908021 rqa=0 rpf=80000000 rpa=0 sif=EST,80020 sib=INI,30
af=(nil),0 csf=0x7f4d34024790,9000
ab=(nil),0 csb=(nil),0
cof=0x7f4d3402b200,80041300:PASS(0x7f4d340247e0)/RAW((nil))/tcpv4(27)
cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Current executing Lua from a stream analyser --
call trace(16):
| 0x55a65e4b427b [89 c5 85 c0 0f 85 db 01]: ha_thread_dump+0x24b/0x448
| 0x55a65e4b44e6 [48 8b 05 c3 6c 22 00 48]: debug_handler+0x66/0x10e
| 0x7f4d3c9af730 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12730
| 0x55a65e3bb062 [e9 79 ff ff ff 66 0f 1f]: hlua_ctx_init+0xb2/0x186
| 0x55a65e3bd871 [85 c0 0f 84 3f 01 00 00]: main+0x32211
| 0x55a65e456eed [85 c0 0f 84 8b 00 00 00]: sample_process+0x4d/0x127
| 0x55a65e457a0c [48 85 c0 74 3f 64 48 63]: sample_fetch_as_type+0x3c/0x9b
| 0x55a65e485baa [48 85 c0 0f 84 e4 15 00]: sess_build_logline+0x204a/0x3b39
| 0x55a65e43949b [85 c0 75 20 48 8b 3d 62]: main+0xade3b
| 0x55a65e4424c9 [85 c0 0f 85 b3 f2 ff ff]: process_stream+0x1f39/0x4f76
Thread 3 : id=0x7f4d3b0af700 act=1 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
stuck=0 prof=1 harmless=0 wantrdv=0
cpu_ns: poll=12751601188 now=12754527148 diff=2925960
curr_task=0x7f4d2c0170a0 (task) calls=1 last=2507502616 ns ago
fct=0x55a65e440590(process_stream) ctx=0x7f4d2c016c50
strm=0x7f4d2c016c50 src=10.1.136.232 fe=tidb be=tidb dst=unknown
rqf=908021 rqa=0 rpf=80000000 rpa=0 sif=EST,80020 sib=INI,30
af=(nil),0 csf=0x7f4d2c0318a0,9000
ab=(nil),0 csb=(nil),0
cof=0x7f4d2c0326e0,80041300:PASS(0x7f4d2c0403a0)/RAW((nil))/tcpv4(24)
cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Current executing Lua from a stream analyser -- stack traceback:
[C]: in method 'execute'
/usr/local/etc/haproxy/lagcheck/lagcheck.lua:91: in function line 90
[C]: in global 'pcall'
/usr/local/etc/haproxy/lagcheck/lagcheck.lua:98: in global 'check_lag'
/usr/local/etc/haproxy/lagcheck/lagcheck.lua:160: in function line 124
*>Thread 4 : id=0x7f4d3a8ae700 act=1 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
stuck=1 prof=1 harmless=0 wantrdv=0
cpu_ns: poll=11591051168 now=14044129048 diff=2453077880
curr_task=0x7f4d300170a0 (task) calls=1 last=2461987384 ns ago
fct=0x55a65e440590(process_stream) ctx=0x7f4d30034b90
strm=0x7f4d30034b90 src=10.1.135.141 fe=tidb be=tidb dst=unknown
rqf=908021 rqa=0 rpf=80000000 rpa=0 sif=EST,80020 sib=INI,30
af=(nil),0 csf=0x7f4d3002d350,9000
ab=(nil),0 csb=(nil),0
cof=0x7f4d30016a70,80041300:PASS(0x7f4d30025ab0)/RAW((nil))/tcpv4(26)
cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Current executing Lua from a stream analyser --
call trace(20):
| 0x55a65e51d607 [48 83 c4 10 5b 5d 41 5c]: wdt_handler+0x107/0x114
| 0x7f4d3c9af730 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12730
| 0x55a65e3bb062 [e9 79 ff ff ff 66 0f 1f]: hlua_ctx_init+0xb2/0x186
| 0x55a65e3bd871 [85 c0 0f 84 3f 01 00 00]: main+0x32211
| 0x55a65e456eed [85 c0 0f 84 8b 00 00 00]: sample_process+0x4d/0x127
| 0x55a65e457a0c [48 85 c0 74 3f 64 48 63]: sample_fetch_as_type+0x3c/0x9b
| 0x55a65e485baa [48 85 c0 0f 84 e4 15 00]: sess_build_logline+0x204a/0x3b39
| 0x55a65e43949b [85 c0 75 20 48 8b 3d 62]: main+0xade3b
| 0x55a65e4424c9 [85 c0 0f 85 b3 f2 ff ff]: process_stream+0x1f39/0x4f76
[NOTICE] 119/123741 (1) : haproxy version is 2.2.13-5f3eb59
[NOTICE] 119/123741 (1) : path to executable is /usr/local/sbin/haproxy
[ALERT] 119/123741 (1) : Current worker #1 (7) exited with code 134 (Aborted)
[ALERT] 119/123741 (1) : exit-on-failure: killing every processes with SIGTERM
[WARNING] 119/123741 (1) : All workers exited. Exiting... (134)
I am fairly new to lua, after a few changes my lua script looks like this:
-- this script requires that there are two backends configured by the name of
-- tidb-primary and tidb-secondary in the backend configuration.
db_driver = require "luasql.mysql"
env = db_driver.mysql()
CONNECTION_RETRY_LIMIT = 5
DEFAULT_BACKEND = 'tidb-primary'
LAG_THRESHOLD = 5000
SECONDS_TIME_BETWEEN_CHECKS = 3
selected_backend = DEFAULT_BACKEND
start_time = os.time()
function print_conection_error()
core.Debug("Failed to connect to tidb")
end
function primary_connection()
return env:connect('api', 'haproxy', 'password', 'mysql-master-url', 4000)
end
function secondary_connection()
return env:connect('api', 'haproxy', 'password', 'mysql-secondary-master-url', 4000)
end
-- Todo: pick primary server-addr from backends table
core.Debug('creating a connection to primary tidb server..')
primary_conn_retry = 0
function create_primary_connection()
conn_status,primary_conn, err = pcall(primary_connection)
if err ~=nil then
core.Debug("ERROR: in create_primary_connection: " .. err )
primary_conn_retry = primary_conn_retry + 1
if primary_conn_retry > CONNECTION_RETRY_LIMIT then
primary_conn_retry = 0
core.Debug("ERROR: Hit retry limit! Could not create a primary connection")
return nil
end
create_primary_connection()
else
return primary_conn
end
end
core.Debug('creating a connection to secondary tidb server..')
secondary_conn_retry = 0
function create_secondary_connection()
conn_status,secondary_conn, err = pcall(secondary_connection)
if err ~= nil then
core.Debug("ERROR: in create_secondary_connection: " .. tostring(err) )
secondary_conn_retry = secondary_conn_retry + 1
if secondary_conn_retry > CONNECTION_RETRY_LIMIT then
secondary_conn_retry = 0
core.Debug("ERROR: Hit retry limit! Could not create a secondary connection")
return nil
end
create_secondary_connection()
else
return secondary_conn
end
end
primary_conn = create_primary_connection()
secondary_conn = create_secondary_connection()
primary_lag, secondary_lag = 0
-- arg:
-- returns: String: DOWN or UP
function srver_status(server)
return server:get_stats()['status']
end
-- arg: Mysql connection object
-- returns: boolean
function is_connected(conn)
core.Debug('Connection: ' .. tostring(conn))
return (string.match(tostring(conn),"MySQL connection") ~= nil and string.match(tostring(conn),"closed") == nil)
end
-- arg: string
-- return: Mysql connection object
function create_connection(backend_name)
if backend_name == 'tidb-primary' then
return create_primary_connection()
elseif backend_name == 'tidb-secondary' then
return create_secondary_connection()
end
end
function query_to_check_lag(conn)
return conn:execute('SELECT replication_lag_dummy from replication_lag_test;')
end
-- arg: Mysql connection object, String: tidb-primary or tidb-secondary
-- returns: lag
function check_lag(conn, server_name)
core.Debug('Checking lag for ' .. server_name .. ' conn: ' .. tostring(conn))
conn_status,cur, err = pcall(query_to_check_lag, conn)
if err ~= nil then
core.Debug("ERROR: check_lag:" .. err)
end
if cur == nil then
core.Debug("Nil cursor. Couldn't fetch replication lag for" .. server_name)
return nil
end
lag = tonumber(cur:fetch())
cur:close()
return lag
end
function time_between_checks_not_passed()
now_time = os.time()
if (now_time - start_time <= SECONDS_TIME_BETWEEN_CHECKS) then
return true
else
start_time = now_time
return false
end
end
function pick_backends(txn)
-- return old picked backend of time between checks is not passsed
if time_between_checks_not_passed() then
return selected_backend
end
-- execute primary lag query
primary_lag = nil
if is_connected(primary_conn) then
primary_lag = check_lag(primary_conn,'tidb-primary')
else
core.Debug('Connection lost! Reconnecting to primary server:')
primary_conn = create_connection('tidb-primary')
if is_connected(primary_conn) then
primary_lag = check_lag(primary_conn,'tidb-primary')
end
end
-- check secondary lag if primary lag is not fetched
if primary_lag == nil then
core.Debug("could not check for replication lag for primary..")
else
core.Debug('primary lag is:' .. tostring(lag))
if primary_lag < LAG_THRESHOLD then
core.Debug('Returning primary server.')
-- should match the backend name in haproxy.cfg
selected_backend = 'tidb-primary'
return 'tidb-primary'
end
end
-- pick secondary from backends table
secondary_lag = nil
if is_connected(secondary_conn) then
secondary_lag = check_lag(secondary_conn,'tidb-secondary')
else
core.Debug("Connection lost! Reconnecting to secondary server..")
secondary_conn = create_connection('tidb-secondary')
if is_connected(secondary_conn) then
secondary_lag = check_lag(secondary_conn,'tidb-secondary')
end
end
-- return default backend if secondary lag is not fetched
if secondary_lag == nil then
core.Debug("could not check for replication lag for secondary, returning default backend")
selected_backend = DEFAULT_BACKEND
return DEFAULT_BACKEND
end
-- print secondary lag if not nil
core.Debug('secondary lag is:' .. tostring(secondary_lag))
if secondary_lag < LAG_THRESHOLD then
core.Debug('Returning secondary server.')
-- should match the secondary server in haproxy.cfg
selected_backend = 'tidb-secondary'
return selected_backend
end
core.Debug("High replication lag, returning server with less lag")
-- if both are high, return server with less lag
if secondary_lag < primary_lag then
selected_backend = 'tidb-secondary'
return selected_backend
else
selected_backend = 'tidb-primary'
return selected_backend
end
end
core.register_fetches('pick_backend', pick_backends)
Here’s the haproxy configurations:
global
log stdout format raw local0 debug
lua-load /usr/local/etc/haproxy/lagcheck/lagcheck.lua
maxconn 100
user haproxy
group haproxy
daemon
defaults
log global
retries 2
timeout connect 7s
timeout server 30000s
timeout client 30000s
frontend tidb
bind *:4001
mode tcp
use_backend %[lua.pick_backend]
option tcplog
backend tidb-primary
mode tcp
server mysql-primary mysql-primary-master:4000 check inter 5000 rise 2 fall 2 maxconn 300
backend tidb-secondary
mode tcp
server mysql-secondary mysql-secondary-master:4000 check inter 5000 rise 2 fall 2 maxconn 300
frontend metrics
mode http
bind *:8080
option http-use-htx
http-request use-service prometheus-exporter if { path /metrics }
stats enable
stats uri /stats
stats refresh 2s
listen stats
bind *:1932
mode http
log global
maxconn 500
stats enable
stats hide-version
stats refresh 30s
stats show-node
stats auth admin:password
stats uri /haproxy?stats
listen health_check_http_url
bind *:8888
mode http
maxconn 5000
monitor-uri /healthz
option dontlognull
I understand the backend servers from the script are not being picked up from the haproxy.cfg, it was changed purposely to test out a few things, there’s a lot of room for refactoring but all of them seem unrelated to this issue at hand.