ACL for Bad User-Agents with SubString Match Fails

georgeg · June 11, 2024, 10:12am

Hello,

We created an ACL for blocking bad user-agents/bots but it gives a false positive 403 in a specific cases which I will explain below.

Here is the config:
acl is-blockedagent-http hdr_sub(user-agent) -f /etc/haproxy/agentblock.lst
http-request deny if is-blockedagent-http

The agentblock.lst list consists of user-agents not in their full form, sample below:
Disco
Discobot
Discoverybot
ZumBot
ZyBorg

So the ACL provided above will normally give a 403 when it finds an exact match of “ZyBorg” anywhere within a user-agent , and again normally will give 200 for “ZyBor” which is missing the last “g” character.
But in cases where similar words exist in the list like “Disco” , “Discobot” and “Discoverybot”,
the ACL will false-fully give a 403 when a user-agent includes the word “Discov”, “Discove” , “Discover”, “Discovery” ,“Discoveryb”, “Discob”, “Discobo”.
So it will trigger a 403 for anything starting with “Disco” and matching any letter from bot or very which are included in “Discobot” and “Discoverybot”

We also tried adding another ACL and list for fixing the above but it doesn’t work:
acl is-blockedagent-http hdr_sub(user-agent) -f /etc/haproxy/agentblock.lst
acl is-goodagent-http hdr_sub(user-agent) -f /etc/haproxy/goodagents.lst
http-request deny if is-blockedagent-http !is-goodagent-http

goodagent.lst:
Discob
Discobo
Discov
Discove
Discover
Discovery
Discoveryb
Discoverybo

Any ideas on how to resolve the above which only occurs when there are user-agents with similar names in the list?
The only alternative working solution we can think of is using req.fhdr(user-agent) but we would need a list of user-agents in their full form which is hard to find.

Thank you

adarragon · June 11, 2024, 12:12pm

Hi,

I’m assuming the user agent names always start with the word you want to match them against, right?

"Disco *blabla*"
"Zyborg *blabla*"
...

If so, then you could leverage the field() converter to only extract the first “word” of the user agent and compare it against the pattern file using the strict string (-m str) matching method:

# use space, slash, underscore and dash as word delimiter using field():
acl is-blockedagent-http "hdr(user-agent),field(1,' /-_')" -m str -f /etc/haproxy/agentblock.lst
http-request deny if is-blockedagent-http

georgeg · June 11, 2024, 1:54pm

Hello,

Thank you for your response.
In most cases the user-agents don`t start with the word we would like to match, examples of complete user-agent names look as follows:

(1) Mozilla/4.0 compatible ZyBorg/1.0 (wn-14.zyborg@looksmart.net; http://www.WISEnutbot.com)
(2)Mozilla/5.0 (compatible; discobot/1.0; +http://discoveryengine.com/discobot.html)

Thank you

lukastribus · June 11, 2024, 2:57pm

Add more data. For example leading space (needs to be escaped) and trailing slash.

\ ZyBorg/
\ discobot/

Otherwise there is no choice but to revert to regex but it will be more expensive:

-m reg -f /etc/haproxy/agentblock.regex

Matching:

\WZyBorg\W
\Wdiscobot\W

georgeg · June 13, 2024, 11:54am

Tried the below with the following config

\ ZyBorg/
\ discobot/

hdr_sub(user-agent) -m reg -f filename ← Gives 403 only when the user-agent includes the slash and backslash.
req.fhdr(user-agent) -m reg -f filename ← Same as above.
hdr_sub(user-agent) -f filename ← Same as above.

I also tried

\WZyBorg\W
\Wdiscobot\W

With
hdr(user-agent) -m reg -f filename
and
hdr_reg(user-agent) -f filename

Thank you for the suggestions.

adarragon · June 13, 2024, 1:16pm

It doesn’t make sense to use both hdr_match-method + -m match-method

Indeed, hdr_sub(name) is already an alias for hdr(name) -m sub

So if you do this:

hdr_sub(user-agent) -m reg -f filename

It ends up being the same as if you used:

hdr(user-agent) -m sub -m reg -f filename

And considering that haproxy only considers the last -m, it won’t work as expected.

Now to complete Lukas answer:

For his first suggestion, escaping the “space” with a backslash will only work for patterns provided directly in the config file but not in dedicated pattern/acl files loaded using -f: character escaping is handled at the HAProxy config parser level, while -f leverages the acl/pattern file parser, which is pretty basic and doesn’t offer advanced features such as escaping. And since leading spaces are ignored when loading patterns from a file using -f (as per the documentation), this cannot work for you.

However with the second suggestion (involving regexp), it should work for your use case:

hdr(user-agent) -m reg -i -f filename

filename:

\WZyBorg\W
\Wdiscobot\W

I tested the regexp example provided by Lukas against the full user agents examples you provided in your previous post, and it seems to do the job, and if not, you may want to adjust the regexp expressions to better fit your needs using an online regexp tester for instance

georgeg · June 17, 2024, 12:32pm

Hello!

Indeed this seems to resolve the issue!
According to our initial tests we dont get false 403s with user-agents including “Discob, Discobo” etc, and it does work properly while using the user-agents with their complete names.

Thank you all for your help!

Topic		Replies	Views
Howto block badbots, crawlers & scrapers using list file Help!	4	9400	February 7, 2017
Comma in acl list file Help!	1	2657	April 13, 2018
Trying to block empty or null user-agent traffic into the site Help!	1	1638	November 24, 2022
Acl to allow users from cn field client certificate Help!	9	3490	October 9, 2018
ACL Pathing not working for /robots.txt Help!	0	282	March 25, 2024

ACL for Bad User-Agents with SubString Match Fails

Related topics