Hi All,
I am seeing issue in our haproxy where backend http responses total stats are not getting updated for few backends. Its a random behavior. We haproxy running in pods and backends are periodically refreshed based on using confd. We use prometheus haproxy exporter to retrieve the stats.
Is there some sort of sampling in the backend http response totals metric? Any special case that it will not update the stats?
Thanks,
Swapnil
Are you perhaps using HAProxy with multiple processes (i.e. by setting nbproc)?
I’m asking because apparently the stats page is individual for each process, and if you don’t use the stats bind-process option, you’ll get the stats from a random process on each request.
No, we don’t have multiple processes. I think what is happening here is when confd is reloading config for haproxy process, it resets the stats. The randomness in the behavior is because confd will only reload haproxy config when there is difference between newly generated config and existing config.
Is there a way of preserving metrics on config reload?
Yes, the statistics are reset whenever the HAProxy process is restarted. (Perhaps even when “reloaded”?)
I’m not sure how to solve this, but you could take a look at server-state-file and load-server-state-from-file perhaps those are helpful in preserving statistics.
We just discovered this problem too.
Yes, it seems haproxy resets Prometheus stats on reload.
It’s true that it shouldn’t be a problem for counters in Prometheus, since they can’t decrease, so it can be handled like overflow.
HOWEVER … HA-proxy also exports a lot of GAUGES … and those have to reflect the actual value.
Are you saying there are incorrect gauges exported to prometheus on reload? If yes, which ones exactly?
I only realized after posting that the last comment was not from Feb 19th 2022, but from 2019.
So after that I downloaded HA-proxy 2.5.4 and tested. It doesn’t seem the problem is in my test setup, but I also didn’t run it in the exact same way as our Ubuntu Focal/ HA-proxy 2.0.x setup, so I’ll have to do some further tests to see whether the problem still exists today. I haven’t found any sign in the HA-proxy changelog that it has been addressed though.
But on the default Ubuntu Focal HA-proxy 2.0.13 it’s clear that Prometheus metrics are reset on process reload - and (as you say) that should not be a problem for counters, but it surely is for gauges.
I’ll return if the problem is still present in 2.5.x
I don’t understand what “resetting gauges” even means. They either show updated and correct values or not.
And right after the reload, things like “current connection count” for example will certainly be zero, because there are indeed zero connections connected to the new process initially.
As I said… I have to check if it’s still an issue.
But what we see on 2.0.13 is that a reload causes the gauge to go to 0 and from then on only INC and DEC affects it. It thus no longer shows the actual 1000+ connections still existing.
The result is a server where netstat shows that haproxy has several thousands TCP connections on front/back … but the exported prometheus gauges only reflect the ones established after reload.
There are no existing connections after reload. This is my point. When you reload, a new process starts with 0 connections.
Connections are NOT migrated from the old to the new process. Instead, the old process keeps handling the old connections, until all of them are closed.
The new process starts with 0 connections and this number will only increase when actual new connections come in.
So this is completely expected behavior.
Ah… I think I understand now…
I makes somewhat sense. And it also explains why my small tests with haproxy 2.5.4 didn’t show a problem since I just ran it in the foreground with 1 process.
However… it still makes the gauges not worth much in a setup where you have very long running connections. For a long time all the old connections till connected to the old process will be invisible to Prometheus.
Exactly, the exported connection count is only regarding the current process, but as you have older processes in the background, the aggregated connection count will be higher.
Whether this specific gauge is useful depends on the situation.
Monitoring the sockets in various states (established and the various closing states) is probably a good idea to understand the bigger picture.
Yes, we actually did monitor the sockets before enabling haproxy prometheus. Maybe we should go back to that.
Even if the reason is a replacement of the process, the semantics of it is the same as gauges being reset and no longer reflecting the state of the service (as seen from the outside)