Can HAProxy affect the internal latency at service server?

I made a HAproxy server connected to 10 web-server and test its performance because my web-service is very sensitive to latency. I measured lots of things related with performance.
I compare two case.
[1] the request is sent to web-server directly.
[2] the request is sent to web-server through HAProxy server.

What I noticed first is internal latency which is the elapsed time from getting request to sending response.
Compared to [1], the internal latency is smaller at [2].
It looks really strange because the web server logic at two case doesn’t change at all.
I have tried to debug it and why this happens since last month but couldn’t find any reason
I have tried to find where the internal latency difference is from and it was making goroutine code.
At making go routine code(it makes 4~5 go routine at the code), it takes smaller time for [2] to execute, compared to [1]

But I really don’t know why it happens. It is not related with network because it is internal.
I guess HAProxy saved resource from reuse sockets or reuse connection and it makes system efficient and the internal latency smaller.
So I measured the internal latency with no option http-server-close or no option http-keep-alived or something similar.
But the result was same with the before. [2]'s internal latency is always smaller than [1]

Could you give me some advices what this happens?
My HAProxy version is HA-Proxy version 1.5.18 2016/05/10
and configuration is below.

    log local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/
    maxconn     40000
    user        haproxy
    group       haproxy
    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats level admin
    nbproc 4
    cpu-map 1 0
    cpu-map 2 1
    cpu-map 3 2
    cpu-map 4 3
    stats bind-process 4

    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
#    option http-keep-alive
#    no option http-keep-alive
#    option http-no-delay
    option forwardfor       except
    option                  redispatch
    retries                 3
    timeout http-request    100s
    timeout queue           10m
    timeout connect         100s
    timeout client          100m
    timeout server          100m
    timeout http-keep-alive 100s
    timeout check           100s
    maxconn                 30000
frontend  main *:8080
    mode http
    bind-process 1 2 3
    default_backend             app
frontend stats *:9000
    stats enable
    stats uri /haproxy_stats

backend app
    mode http
    balance     roundrobin
    option httpchk GET /nginx_status
    option httplog
    server  app1 check
    server  app2 check
    server  app3 check
    server  app4 check
    server  app5 check
    server  app6 check
    server  app7 check
    server  app8 check
    server  app9 check
    server  app10 check

You’d have to explain at the very least:

  • what benchmark tool you are using
  • the * exact and complete* benchmarking configuration
  • exact and complete benchmarking result
I appreciate your help :+1: :bowing_man:

  • The benchmark tool I used is nGrinder

    • It do stress test by written script.
    • I made a script sending http request to each server with setting TPS
    • It report the result of test, including the network latency, the total errors and so on.
  • The benchmark configuration is below

    • Experiment topology
      • 1st topology is [HWLB(L3DSR mode)] - [10 Servers]
      • 2nd topology is [HWLB(L3DSR mode)] - [HAProxy Server] - [10 Servers]
    • At these topology, I send http request about 1000TPS to HWLB by nGrinder, and then each server get requests at about 100TPS from HWLB or HAProxy.
    • At each server, There are nginx and our web server made by Go.
    • HAProxy Server and 10 Server specification is Xeon Silver 4210 (2.2GHz/10core)*2, 64GB RAM, 25G NIC
    • Measured metrics
      • CPU and memory usage : by linux tool such as htop
      • The internal latency : by time package at GO and collected by Prometheus
        • start stamp : right after getting requests. starting time our service logic
        • end stamp : right before sending requests. ending time our service logic
  • The benchmark result is below. I get below result repeatedly.

    • Requests rates graph shows each server gets 100 TPS on average.
    • Latency p99 graph at the left request rates shows using only HWLB case has a little higher latency. Some spikes is from API delay where our service server accesses external servers.
    • Latency p99 graph below is measured at the code making 3~4 Goroutine. using only HWLB case has a little higher elapsed time at this code.
    • CPU and Memory usage is not bottleneck because its usage is below 20%. NIC usage is too.
    • Because the internal latency difference is from making Goroutine code, I check Go GC duration counts, Go GC duration and number of Goroutine at each test case. But they were same.
    • But some metrics related with Go were different like Go heap/stack memory usage, the number of fd and so on.

I think it is really strange because the internal logics in service server are never changed. The only change is using HAProxy. I don’t understand my test result. But I try to find the answer.