Voozh

I don't really like bad bots, and by that I mean crawlers that don't care about robots.txt. The reason is simple: I don't want my data fed into obscure systems, and also just by principle, if we give you rules, follow them.

Credit where it's due: the idea came from Caolan's website.

The idea is simple: make the bad bots click a link they aren't supposed to, then ban them. To do that, I added a robots.txt at the root of my site, explicitly disallowing robots from a specific page (I went with /roboty/, because why not):

User-agent: *
Disallow: /roboty/

Then I slipped a link to that page somewhere on the root page.

👁 link

Since I don't want curious humans getting instantly banned, the page itself just explains what's going on and links to article.php, the actual dangerous script. I named it like that to bypass possible keyword blacklists like ban or ban-ip. ¯\_(ツ)_/¯

Talking about the script, here it is:

<?php

$cf_api_token = '...';
$zone_id = '...';
$note = 'Auto banned by dtech/roboty at ' . date("H:i d/m/y");
$ip = $_SERVER['REMOTE_ADDR'];

$payload = json_encode([
 'mode' => 'block',
 'configuration' => [
 'target' => 'ip',
 'value' => $ip,
 ],
 'notes' => $note,
]);

$ch = curl_init("https://api.cloudflare.com/client/v4/zones/{$zone_id}/firewall/access_rules/rules");
curl_setopt_array($ch, [
 CURLOPT_RETURNTRANSFER => true,
 CURLOPT_POST => true,
 CURLOPT_POSTFIELDS => $payload,
 CURLOPT_IPRESOLVE => CURL_IPRESOLVE_V4,
 CURLOPT_HTTPHEADER => [
 "Authorization: Bearer {$cf_api_token}",
 "Content-Type: application/json",
 ],
]);

$response = json_decode(curl_exec($ch), true);
curl_close($ch);

header("Location: /?blehhhhh"); // redirect to '/', should be blocked
echo "Bye ;)";

Right now it only bans the bot's IP on douxx.tech (proxied through Cloudflare), but I plan to eventually implement it into an internal API to block across every domain I own, and maybe throw in some iptables rules too.

So yeah, I'll keep it running for a bit and see how many IPs we get.

For the record, the first one to be banned is an IP from Tencent datacenters 🤡

👁 tencent ipban

👁 ip info screenshot

URL: https://dev.to/douxxtech/an-attempt-to-ban-bad-bots-crawling-my-sites-2lhg

⇱ An Attempt to Ban Bad Bots Crawling My Sites - DEV Community