Cheapskate's Guide: Nuking web-scraping bots

klu9@lemmy.ca · 4 days ago

Cheapskate's Guide: Nuking web-scraping bots

Jason2357@lemmy.ca · 3 days ago

This is signal detection theory combined with an arms race that keeps the problem hard. You cannot block scrapers without blocking people, and you cannot inconvenience bots without also inconveniencing readers. You might figure something clever out temporarily, but eventually this truism will resurface. Excuse me while I solve a few more captchas.

Tobberone@lemm.ee · 3 days ago

The internet as we know it is dead, we just need a few more years to realise it. And I’m afraid that telecommunications will be going the same way, when no-one can trust that anyone is who they say anymore.

irmadlad@lemmy.world · 3 days ago

Excuse me while I solve a few more captchas.

Buster for captcha.

野麦さん@lemmy.dbzer0.com · 3 days ago

Time to start hosting Trojans on your website

Stubb@lemmy.sdf.org · 3 days ago

I’ve found that many of these solutions/hacks block legitimate users that are using the tor browser and Internet Archive scrapers, which may be a dealbreaker for some but maybe acceptable for most users and website owners.

F04118F@feddit.nl · 4 days ago

Interesting approach but looks like this ultimately ends up:

being a lot of babysitting / manual work
blocking a lot of humans
not being robust against scrapers

Anubis seems like a much better option, for those wanting to block bots without relying on Cloudflare:

https://anubis.techaro.lol/

drkt@lemmy.dbzer0.com · 4 days ago

I have plenty of spare bandwidth and babysitting-resources so my approach is largely to waste their time. If they poke my honeypot they get poked back and have to escape a tarpit specifically designed to waste their bandwidth above all. It costs me nothing because of my circumstances but I know it costs them because their connections are metered. I also know it works because they largely stop crawling my domains I employ this on. I am essentially making my domains appear hostile.

It does mean that my residential IP ends up on various blocklists but I’m just at a point in my life where I don’t give an unwiped asshole about it. I can’t access your site? I’m not going to your site, then. Fuck you. I’m not even gonna email you about the false-positive.

It is also fun to keep a log of which IPs have poked the honeypot have open ports, and to automate a process of siphoning information out of those ports. Finding a lot of hacked NVR’s recently I think are part of some IoT botnet to scrape the internet.

melroy@kbin.melroy.org · 4 days ago

I found a very large botnet in Brazil mainly and several other countries. And abuseipdb.com is not marking those IPs are a thread. We need a better solution.

I think a honeypot is a good way. Another way is to use proof of work basically on the client side. Or we need a better place to share all stupid web scraping bot IPs.

drkt@lemmy.dbzer0.com · 4 days ago

I love the idea of abuseipdb and I even contributed to it briefly. Unfortunately, even as a contributor, I don’t get enough API resources to actually use it for my own purposes without having to pay. I think the problem is simply that if you created a good enough database of abusive IPs then you’d be overwhelmed in traffic trying to pull that data out.

melroy@kbin.melroy.org · 4 days ago

Not really… We do have this wonderful list(s): https://github.com/firehol/blocklist-ipsets

And my firewall is using for example the Spamhaus drop list source: https://raw.githubusercontent.com/firehol/blocklist-ipsets/refs/heads/master/spamhaus_drop.netset

So I know its possible. And hosting in a git repo like that, will scale a lot. Since tons of people using this already that way.

paraphrand@lemmy.world · 4 days ago

That last bit looks like something you should send off to a place like 404 media.

drkt@lemmy.dbzer0.com · 4 days ago

I wouldn’t even know where to begin, but I also don’t think that what I’m doing is anything special. These NVR IPs are hurling abuse at the whole internet. Anyone listening will have seen them, and anyone paying attention would’ve seen the pattern.

The NVRs I get the most traffic from have been a known hacked IoT device for a decade and even has a github page explaining how to bypass their authentication and pull out arbitrary files like passwd.

oyzmo@lemmy.world · 4 days ago

Thanks, great site! 😊

klu9@lemmy.ca · 2 days ago

You’re welcome.

I believe I found it originally via the “distribuverse”… specifically, ZeroNet.

Cheapskate's Guide: Nuking web-scraping bots

Cheapskate's Guide: Nuking web-scraping bots

Nuking the Corporate Web's Web-Scraping Robots