Howto block badbots, crawlers & scrapers using list file

chomps · February 7, 2017, 3:21pm

Hi,

I want to block badbots and crawlers from hitting any backend servers. An example bot, taken from apache log is as follows:

HTTP/1.1" 403 539 “-” “Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.0; trendictionbot0.5.0; trendiction search; http://www.trendiction.de/bot; please let us know of any problems; web at trendiction.com) Gecko/20071127 Firefox/3.0.0.11”

I have this in my haproxy config:
acl badbots hdr_reg(User-Agent) -i -f /etc/haproxy/badbots.lst
tcp-request content reject if badbots

but it doesn’t seem to be working as I still see the request coming to the apache log, unles the “403” means that it is in fact getting blocked. But then it shouldn’t even be there if it is blocked at the HAP side. The badbots.lst file contains:
rubrikkgroup\ .com
Baiduspider
Sosospider
Sogou
ZumBot
Yandex
trendictionbot0\ .5\ .0
trendiction\ .com
trendiction

I would really appreciate some help if someone knows how to block these ‘invading’ url’s
Regards

jurgenhaas · February 7, 2017, 3:49pm

My definition in the frontend looks like this:

  acl blockedagent hdr_sub(user-agent) -i -f /etc/haproxy/blacklist.agent
  http-request deny if blockedagent

That blocks agents in the given file successfully, where the blacklist file contains one line for each blocked agent in clear text, without escaping special characters.

In your case, is the Apache host the same as the HaProxy host? If they are different, is it possible that the bots are sending their requests to the Apache directly, bypassing HaProxy?

chomps · February 7, 2017, 4:05pm

Hi Jurgenhaas,
Thank you for your response.
The backend consists of three seperate apache2 servers all running the same content. There is no other way into the environment but via haproxy first. I have tried with your change and this seems to be working now. Thank you very much.

I think the difference here where I had “hdr_reg(User-Agent)” and you have "hdr_sub(user-agent)"
either the hdr is correct or ‘user-agent’ is case sensitive although when I checked my config with “haproxy -f /etc/haproxy/haproxy.cfg -c” it didn’t show any errors. So I would assume that the “hdr_sub” is the one that helped.

Do you know of any way I could prevent new bots from attempting to crawl our site as I find it places load on the ‘overall picture’ if you know what I mean.

Many thanks

jurgenhaas · February 7, 2017, 4:29pm

Great it helped. Haven’t investigeted in the reason what’s different either, but maybe we don’T need that right now.

Regarding new bots, I guess we have to keep our blacklists up-to-date manually. Sometimes I even have to remove an agent from the list because some customers have issues with them from time to time, when one of their applications comes with a user-agent that matches a line in my blacklist. Probably there is a recommended list maintained somewhere on the net? Maybe worth a search.

chomps · February 7, 2017, 4:33pm

Awesome, Thank you.
Bot lists:

http://www.botreports.com/
http://www.user-agents.org/

Topic		Replies	Views
ACL for Bad User-Agents with SubString Match Fails Help!	6	942	June 17, 2024
ACL Pathing not working for /robots.txt Help!	0	282	March 25, 2024
Trying to block empty or null user-agent traffic into the site Help!	1	1638	November 24, 2022
Help about Haproxy control list, how can allow only some specific links to pass haproxy server Help!	4	2836	July 5, 2016
IP whiteList does not seems to work Help!	4	3279	February 28, 2021

Howto block badbots, crawlers & scrapers using list file

Related topics