HAProxy terminates unexpectedly while executing mysql replication lag check

I am in a process of creating Lua script to check lag between two different mysql instances and to failover to the other instance when first the primary is unavailable or the replication lag is high. HAProxy is running as a Kubernetes deployment. I am facing random container restarts due to the following error once or twice in a day:

Connection: MySQL connection (0x55a65e962118)
Checking lag for tidb-primary conn: MySQL connection (0x55a65e962118)
primary lag is:99141
Connection: MySQL connection (0x55a65e96a968)
Checking lag for tidb-secondary conn: MySQL connection (0x55a65e96a968)
Thread 4 is about to kill the process.
 >Thread 1 : id=0x7f4d3c214200 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
             stuck=1 prof=1 harmless=0 wantrdv=0
             cpu_ns: poll=13486920498 now=15196499898 diff=1709579400
             curr_task=0x55a65e920a00 (task) calls=1 last=1712927637 ns ago
               fct=0x55a65e440590(process_stream) ctx=0x55a65e9fc330
             strm=0x55a65e9fc330 src=10.1.136.232 fe=tidb be=tidb dst=unknown
             rqf=908021 rqa=0 rpf=80000000 rpa=0 sif=EST,80020 sib=INI,30
             af=(nil),0 csf=0x55a65e9fc240,9000
             ab=(nil),0 csb=(nil),0
             cof=0x55a65e935920,80041300:PASS(0x55a65e9df2b0)/RAW((nil))/tcpv4(28)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
             Current executing Lua from a stream analyser --
             call trace(17):
             | 0x55a65e4b427b [89 c5 85 c0 0f 85 db 01]: ha_thread_dump+0x24b/0x448
             | 0x55a65e4b44e6 [48 8b 05 c3 6c 22 00 48]: debug_handler+0x66/0x10e
             | 0x7f4d3c9af730 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12730
             | 0x55a65e3bb062 [e9 79 ff ff ff 66 0f 1f]: hlua_ctx_init+0xb2/0x186
             | 0x55a65e3bd871 [85 c0 0f 84 3f 01 00 00]: main+0x32211
             | 0x55a65e456eed [85 c0 0f 84 8b 00 00 00]: sample_process+0x4d/0x127
             | 0x55a65e457a0c [48 85 c0 74 3f 64 48 63]: sample_fetch_as_type+0x3c/0x9b
             | 0x55a65e485baa [48 85 c0 0f 84 e4 15 00]: sess_build_logline+0x204a/0x3b39
             | 0x55a65e43949b [85 c0 75 20 48 8b 3d 62]: main+0xade3b
             | 0x55a65e4424c9 [85 c0 0f 85 b3 f2 ff ff]: process_stream+0x1f39/0x4f76
 >Thread 2 : id=0x7f4d3c1f0700 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
             stuck=1 prof=1 harmless=0 wantrdv=0
             cpu_ns: poll=12105320018 now=14481791728 diff=2376471710
             curr_task=0x7f4d34017330 (task) calls=1 last=2380083314 ns ago
               fct=0x55a65e440590(process_stream) ctx=0x7f4d34017560
             strm=0x7f4d34017560 src=10.1.136.232 fe=tidb be=tidb dst=unknown
             rqf=908021 rqa=0 rpf=80000000 rpa=0 sif=EST,80020 sib=INI,30
             af=(nil),0 csf=0x7f4d34024790,9000
             ab=(nil),0 csb=(nil),0
             cof=0x7f4d3402b200,80041300:PASS(0x7f4d340247e0)/RAW((nil))/tcpv4(27)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
             Current executing Lua from a stream analyser --
             call trace(16):
             | 0x55a65e4b427b [89 c5 85 c0 0f 85 db 01]: ha_thread_dump+0x24b/0x448
             | 0x55a65e4b44e6 [48 8b 05 c3 6c 22 00 48]: debug_handler+0x66/0x10e
             | 0x7f4d3c9af730 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12730
             | 0x55a65e3bb062 [e9 79 ff ff ff 66 0f 1f]: hlua_ctx_init+0xb2/0x186
             | 0x55a65e3bd871 [85 c0 0f 84 3f 01 00 00]: main+0x32211
             | 0x55a65e456eed [85 c0 0f 84 8b 00 00 00]: sample_process+0x4d/0x127
             | 0x55a65e457a0c [48 85 c0 74 3f 64 48 63]: sample_fetch_as_type+0x3c/0x9b
             | 0x55a65e485baa [48 85 c0 0f 84 e4 15 00]: sess_build_logline+0x204a/0x3b39
             | 0x55a65e43949b [85 c0 75 20 48 8b 3d 62]: main+0xade3b
             | 0x55a65e4424c9 [85 c0 0f 85 b3 f2 ff ff]: process_stream+0x1f39/0x4f76
  Thread 3 : id=0x7f4d3b0af700 act=1 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
             stuck=0 prof=1 harmless=0 wantrdv=0
             cpu_ns: poll=12751601188 now=12754527148 diff=2925960
             curr_task=0x7f4d2c0170a0 (task) calls=1 last=2507502616 ns ago
               fct=0x55a65e440590(process_stream) ctx=0x7f4d2c016c50
             strm=0x7f4d2c016c50 src=10.1.136.232 fe=tidb be=tidb dst=unknown
             rqf=908021 rqa=0 rpf=80000000 rpa=0 sif=EST,80020 sib=INI,30
             af=(nil),0 csf=0x7f4d2c0318a0,9000
             ab=(nil),0 csb=(nil),0
             cof=0x7f4d2c0326e0,80041300:PASS(0x7f4d2c0403a0)/RAW((nil))/tcpv4(24)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
             Current executing Lua from a stream analyser -- stack traceback:
                 [C]: in method 'execute'
                 /usr/local/etc/haproxy/lagcheck/lagcheck.lua:91: in function line 90
                 [C]: in global 'pcall'
                 /usr/local/etc/haproxy/lagcheck/lagcheck.lua:98: in global 'check_lag'
                 /usr/local/etc/haproxy/lagcheck/lagcheck.lua:160: in function line 124
*>Thread 4 : id=0x7f4d3a8ae700 act=1 glob=0 wq=0 rq=0 tl=0 tlsz=0 rqsz=0
             stuck=1 prof=1 harmless=0 wantrdv=0
             cpu_ns: poll=11591051168 now=14044129048 diff=2453077880
             curr_task=0x7f4d300170a0 (task) calls=1 last=2461987384 ns ago
               fct=0x55a65e440590(process_stream) ctx=0x7f4d30034b90
             strm=0x7f4d30034b90 src=10.1.135.141 fe=tidb be=tidb dst=unknown
             rqf=908021 rqa=0 rpf=80000000 rpa=0 sif=EST,80020 sib=INI,30
             af=(nil),0 csf=0x7f4d3002d350,9000
             ab=(nil),0 csb=(nil),0
             cof=0x7f4d30016a70,80041300:PASS(0x7f4d30025ab0)/RAW((nil))/tcpv4(26)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
             Current executing Lua from a stream analyser --
             call trace(20):
             | 0x55a65e51d607 [48 83 c4 10 5b 5d 41 5c]: wdt_handler+0x107/0x114
             | 0x7f4d3c9af730 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12730
             | 0x55a65e3bb062 [e9 79 ff ff ff 66 0f 1f]: hlua_ctx_init+0xb2/0x186
             | 0x55a65e3bd871 [85 c0 0f 84 3f 01 00 00]: main+0x32211
             | 0x55a65e456eed [85 c0 0f 84 8b 00 00 00]: sample_process+0x4d/0x127
             | 0x55a65e457a0c [48 85 c0 74 3f 64 48 63]: sample_fetch_as_type+0x3c/0x9b
             | 0x55a65e485baa [48 85 c0 0f 84 e4 15 00]: sess_build_logline+0x204a/0x3b39
             | 0x55a65e43949b [85 c0 75 20 48 8b 3d 62]: main+0xade3b
             | 0x55a65e4424c9 [85 c0 0f 85 b3 f2 ff ff]: process_stream+0x1f39/0x4f76
[NOTICE] 119/123741 (1) : haproxy version is 2.2.13-5f3eb59
[NOTICE] 119/123741 (1) : path to executable is /usr/local/sbin/haproxy
[ALERT] 119/123741 (1) : Current worker #1 (7) exited with code 134 (Aborted)
[ALERT] 119/123741 (1) : exit-on-failure: killing every processes with SIGTERM
[WARNING] 119/123741 (1) : All workers exited. Exiting... (134)

I am fairly new to lua, after a few changes my lua script looks like this:

-- this script requires that there are two backends configured by the name of 
-- tidb-primary and tidb-secondary in the backend configuration.

db_driver = require "luasql.mysql"
env = db_driver.mysql()
CONNECTION_RETRY_LIMIT = 5
DEFAULT_BACKEND = 'tidb-primary'
LAG_THRESHOLD = 5000
SECONDS_TIME_BETWEEN_CHECKS = 3
selected_backend = DEFAULT_BACKEND
start_time = os.time()


function print_conection_error()
  core.Debug("Failed to connect to tidb")
end

function primary_connection()
  return env:connect('api', 'haproxy', 'password', 'mysql-master-url', 4000)
end

function secondary_connection()
  return env:connect('api', 'haproxy', 'password', 'mysql-secondary-master-url', 4000)
end

-- Todo: pick primary server-addr from backends table
core.Debug('creating a connection to primary tidb server..')
primary_conn_retry = 0
function create_primary_connection()
    conn_status,primary_conn, err = pcall(primary_connection)
    if err ~=nil then
      core.Debug("ERROR: in create_primary_connection: " .. err )
      primary_conn_retry = primary_conn_retry + 1
      if primary_conn_retry > CONNECTION_RETRY_LIMIT then
        primary_conn_retry = 0
        core.Debug("ERROR: Hit retry limit! Could not create a primary connection")
        return nil
      end
      create_primary_connection()
    else
      return primary_conn
    end
end

core.Debug('creating a connection to secondary tidb server..')
secondary_conn_retry = 0
function create_secondary_connection()
    conn_status,secondary_conn, err = pcall(secondary_connection)
    if err ~= nil then
      core.Debug("ERROR: in create_secondary_connection: " .. tostring(err) )
      secondary_conn_retry = secondary_conn_retry + 1
      if secondary_conn_retry > CONNECTION_RETRY_LIMIT then
        secondary_conn_retry = 0
        core.Debug("ERROR: Hit retry limit! Could not create a secondary connection")
        return nil
      end
      create_secondary_connection()
    else
      return secondary_conn
    end
end

primary_conn = create_primary_connection()
secondary_conn = create_secondary_connection()
primary_lag, secondary_lag = 0

-- arg:
-- returns: String: DOWN or UP
function srver_status(server)
  return server:get_stats()['status']
end

-- arg: Mysql connection object
-- returns: boolean
function is_connected(conn)
  core.Debug('Connection: ' .. tostring(conn))
  return (string.match(tostring(conn),"MySQL connection") ~= nil and string.match(tostring(conn),"closed") == nil)
end

-- arg: string
-- return: Mysql connection object
function create_connection(backend_name)
  if backend_name == 'tidb-primary' then
    return create_primary_connection()
  elseif backend_name == 'tidb-secondary' then
    return create_secondary_connection()
  end
end

function query_to_check_lag(conn)
  return conn:execute('SELECT replication_lag_dummy from replication_lag_test;')
end

-- arg: Mysql connection object, String: tidb-primary or tidb-secondary
-- returns: lag
function check_lag(conn, server_name)
    core.Debug('Checking lag for ' .. server_name .. ' conn: ' .. tostring(conn))
    conn_status,cur, err = pcall(query_to_check_lag, conn)
    
    if err ~= nil then
        core.Debug("ERROR: check_lag:" .. err)
    end

    if cur == nil then
        core.Debug("Nil cursor. Couldn't fetch replication lag for" .. server_name)
        return nil
    end

    lag = tonumber(cur:fetch())
    cur:close()
    return lag
end

function time_between_checks_not_passed()
  now_time = os.time()
  if (now_time - start_time <= SECONDS_TIME_BETWEEN_CHECKS) then
    return true
  else
    start_time = now_time
    return false
  end
end

function pick_backends(txn)

-- return old picked backend of time between checks is not passsed
  if time_between_checks_not_passed() then
    return selected_backend
  end

-- execute primary lag query
  primary_lag = nil
  if is_connected(primary_conn) then 
    primary_lag = check_lag(primary_conn,'tidb-primary')
  else
    core.Debug('Connection lost! Reconnecting to primary server:')
    primary_conn = create_connection('tidb-primary')
    if is_connected(primary_conn) then
        primary_lag = check_lag(primary_conn,'tidb-primary')
    end
  end

-- check secondary lag if primary lag is not fetched
  if primary_lag == nil then
    core.Debug("could not check for replication lag for primary..")
  else
    core.Debug('primary lag is:' .. tostring(lag))
    if primary_lag < LAG_THRESHOLD then
      core.Debug('Returning primary server.')
      -- should match the backend name in haproxy.cfg
      selected_backend = 'tidb-primary'
      return 'tidb-primary'
    end
  end

-- pick secondary from backends table
  secondary_lag = nil
  if is_connected(secondary_conn) then 
    secondary_lag = check_lag(secondary_conn,'tidb-secondary')
  else
    core.Debug("Connection lost! Reconnecting to secondary server..")
    secondary_conn = create_connection('tidb-secondary')
    if is_connected(secondary_conn) then
      secondary_lag = check_lag(secondary_conn,'tidb-secondary')
    end
  end

-- return default backend if secondary lag is not fetched
  if secondary_lag == nil then
    core.Debug("could not check for replication lag for secondary, returning default backend")
    selected_backend = DEFAULT_BACKEND
    return DEFAULT_BACKEND
  end

-- print secondary lag if not nil
  core.Debug('secondary lag is:' .. tostring(secondary_lag))

  if secondary_lag < LAG_THRESHOLD then
    core.Debug('Returning secondary server.')
    -- should match the secondary server in haproxy.cfg
    selected_backend = 'tidb-secondary'
    return selected_backend
  end

  core.Debug("High replication lag, returning server with less lag")
-- if both are high, return server with less lag
  if secondary_lag < primary_lag then
    selected_backend = 'tidb-secondary'
    return selected_backend
  else
    selected_backend = 'tidb-primary'
    return selected_backend
  end
end
core.register_fetches('pick_backend', pick_backends)

Here’s the haproxy configurations:

global
    log stdout format raw local0 debug
    lua-load /usr/local/etc/haproxy/lagcheck/lagcheck.lua
    maxconn 100
    user haproxy
    group haproxy
    daemon

defaults
    log global
    retries 2
    timeout connect 7s
    timeout server 30000s
    timeout client 30000s

frontend tidb
    bind *:4001
    mode tcp
    use_backend %[lua.pick_backend]
    option tcplog

backend tidb-primary
    mode tcp
    server mysql-primary mysql-primary-master:4000 check inter 5000 rise 2 fall 2 maxconn 300

backend tidb-secondary
    mode tcp
    server mysql-secondary mysql-secondary-master:4000 check inter 5000 rise 2 fall 2 maxconn 300

frontend metrics
    mode http
    bind *:8080
    option http-use-htx
    http-request use-service prometheus-exporter if { path /metrics }
    stats enable
    stats uri /stats
    stats refresh 2s

listen  stats
    bind            *:1932
    mode            http
    log             global

    maxconn 500

    stats enable
    stats hide-version
    stats refresh 30s
    stats show-node
    stats auth admin:password
    stats uri  /haproxy?stats

listen health_check_http_url
    bind *:8888
    mode http
    maxconn 5000
    monitor-uri /healthz
    option      dontlognull

I understand the backend servers from the script are not being picked up from the haproxy.cfg, it was changed purposely to test out a few things, there’s a lot of room for refactoring but all of them seem unrelated to this issue at hand.