Howto block badbots, crawlers & scrapers using list file

Hi,

I want to block badbots and crawlers from hitting any backend servers. An example bot, taken from apache log is as follows:

HTTP/1.1" 403 539 “-” “Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.0; trendictionbot0.5.0; trendiction search; http://www.trendiction.de/bot; please let us know of any problems; web at trendiction.com) Gecko/20071127 Firefox/3.0.0.11”

I have this in my haproxy config:
acl badbots hdr_reg(User-Agent) -i -f /etc/haproxy/badbots.lst
tcp-request content reject if badbots

but it doesn’t seem to be working as I still see the request coming to the apache log, unles the “403” means that it is in fact getting blocked. But then it shouldn’t even be there if it is blocked at the HAP side. The badbots.lst file contains:
rubrikkgroup\ .com
Baiduspider
Sosospider
Sogou
ZumBot
Yandex
trendictionbot0\ .5\ .0
trendiction\ .com
trendiction

I would really appreciate some help if someone knows how to block these ‘invading’ url’s
Regards

My definition in the frontend looks like this:

  acl blockedagent hdr_sub(user-agent) -i -f /etc/haproxy/blacklist.agent
  http-request deny if blockedagent

That blocks agents in the given file successfully, where the blacklist file contains one line for each blocked agent in clear text, without escaping special characters.

In your case, is the Apache host the same as the HaProxy host? If they are different, is it possible that the bots are sending their requests to the Apache directly, bypassing HaProxy?

Hi Jurgenhaas,
Thank you for your response.
The backend consists of three seperate apache2 servers all running the same content. There is no other way into the environment but via haproxy first. I have tried with your change and this seems to be working now. Thank you very much.

I think the difference here where I had “hdr_reg(User-Agent)” and you have "hdr_sub(user-agent)"
either the hdr is correct or ‘user-agent’ is case sensitive although when I checked my config with “haproxy -f /etc/haproxy/haproxy.cfg -c” it didn’t show any errors. So I would assume that the “hdr_sub” is the one that helped.

Do you know of any way I could prevent new bots from attempting to crawl our site as I find it places load on the ‘overall picture’ if you know what I mean.

Many thanks

Great it helped. Haven’t investigeted in the reason what’s different either, but maybe we don’T need that right now.

Regarding new bots, I guess we have to keep our blacklists up-to-date manually. Sometimes I even have to remove an agent from the list because some customers have issues with them from time to time, when one of their applications comes with a user-agent that matches a line in my blacklist. Probably there is a recommended list maintained somewhere on the net? Maybe worth a search.

Awesome, Thank you.
Bot lists:

http://www.botreports.com/
http://www.user-agents.org/