Block all datacenters

Hi there,

I noticed right away that my completely obscure little home-hosted private application was swarmed with scraping requests from data centers all over the place.

Every time I looked one of those spammy IPs up, it was some kind of data center behind it. I was wondering if there is a way to filter this out categorically. There used to be robots.txt that was respected by search engines, it seems AI operators have no ethical code of conduct.

I was thinking along the lines of using lookup services, such as whatismyipaddress.com. The information there tells me at least if a data center is involved. I would like to use this to feed an IP blacklisting mechanism. I already have one in place using a file.

Is this possible with HAProxy? Or are there other ways to lock data centers out?

I tried to set up a little script that would have used whatismyipaddress.com to detect data centers, unfortunately they do block scripted access.

Is there an alternative approach someone is aware of?

With haproxy you can block static lists or map files, you already know that.

Where you get those informations from is a more difficult question to answer.

A good geo IP provider like ipinfo.io can provide this information, but not for free. And still AI crawlers start hiding behind residential proxies, so static lists of subnets will become more ineffective by the day because there is a trillion dollar industry in dire need to bypass those protections.

The most efficient and cheap method to not have your HTTP 443 services automatically discovered by AI bots (which are monitoring certificate transparency logs) is to to use a wildcard certificate and use an non public subdomain, like:

Your TLD: example.org
Your internal service name: secrectservice01.example.org

You need a paid TLD for that, because you to use the ACME DNS-01 challenge.

If you think this suffices, you can use a javascript challenge (this is also used by the debian bugtracker iirc):

But there is no perfect, free, fast and accurate solution to this problem.

I did a dirty thing. I created a python script that uses chromium webdriver to get the information i need. Now I scrape the logs for IP addresses and check each one. If I find it is a data center, I add it to the blacklist. After each pass, I restart haproxy.

I intend to put this into a cron job. This is not optimal, though.

You can add new IP addresses to both the list file and to the existing haproxy instance by using the admin socket.

So you avoid to have to reload every time, and if you reload/restart for other reason, you still have an uptodate file.