A tiny mouse, a hacker.

See here for an introduction, and my link tree for socials.

  • 0 Posts
  • 7 Comments
Joined 1 year ago
cake
Cake day: December 24th, 2023

help-circle

  • you disallow access to your website

    I do. Any legit visitor is free to roam around. I keep the baddies away, like if I were using a firewall. You do use a firewall, right?

    when the user agent is a little unusual

    Nope. I disallow them when the user agent is very obviously fake. Noone in 2025 is going to browse the web with “Firefox 3.8pre5”, or “Mozilla/4.0”, or a decade old Opera, or Microsoft Internet Explorer 5.0. None of those would be able to connect anyway, because they do not support modern TLS ciphers required. The rest are similarly unrealistic.

    nepenthes. make them regret it

    What do you think happens when a bad agent is caught by my rules? They end up in an infinite maze of garbage, much like the one generated by nepenthes. I use my own generator (iocaine), for reasons, but it is very similar to nepenthes. But… I’m puzzled now. Just a few lines above, you argued that I am disallowing access to my website, and now you’re telling me to use an infinite maze of garbage to serve them instead?

    That is precisely what I am doing.

    By the way, nepenthes/iocaine/etc alone does not do jack shit against these sketchy agents. I can guide them into the maze, but as long as they can access content outside of it, they’ll keep bombarding my backend, and will keep training on my work. There are two ways to stop them: passive identification, like my sketchy agents ruleset, or proof-of-work solutions like Anubis. Anubis has the huge downside that it is very disruptive to legit visitors. So I’m choosing the lesser evil.


  • This feature will fetch the page and summarize it locally. It’s not being used for training LLMs.

    And what do you think the local model is trained on?

    It’s practically like the user opened your website manually and skimmed the content

    It is not. A human visitor will skim through, and pick out the parts they’re interested in. A human visitor has intelligence. An AI model does not. An AI model has absolutely no clue what they user is looking for, and it is entirely possible (and frequent) that it discards the important bits, and dreams up some bullshit. Yes, even local ones. Yes, I tried, on my own sites. It was bad.

    It has value to a lot of people including me so it’s not garbage.

    If it does, please don’t come anywhere near my stuff. I don’t share my work only for an AI to throw away half of it and summarize it badly.

    But if you make it garbage intentionally then everyone will just believe your website is garbage and not click the link after reading the summary.

    If people who prefer AI summaries stop visiting, I’ll consider that as a win. I write for humans, not for bots. If someone doesn’t like my style, or finds me too verbose, then my content is not for them, simple as that. And that’s ok, too! I have no intention of appealing to everyone.


  • Pray tell, how am I making anyone’s browsing experience worse? I disallow LLM scrapers and AI agents. Human visitors are welcome. You can visit any of my sites with Firefox, even 139 Nightly, and it will Just Work Fine™. It will show garbage if you try to use an AI summary, but AI summaries are garbage anyway, so nothing of value is lost there.

    I’m all for a free and open internet, as long as my visitors act respectfully, and don’t try to DDoS me from a thousand IP addresses, trying to train on my work, without respecting the license. The LLM scrapers and AI agents do not respect my work, nor its license, so they get a nice dose of garbage. Coincidentally, this greatly reduces the load on my backend, so legit visitors can actually access what they seek. Banning LLM scrapers & AI bots improves the experience of my legit visitors, because my backend doesn’t crumble under the load.


  • Overboard? Because I disallow AI summaries?

    Or are you referring to my “try to detect sketchy user agents” ruleset? Because that had two false positives in the past two months, yet, those rules are responsible for stopping about 2.5 million requests per day, none of which were from a human (I’d know, human visitors have very different access patterns, even when they visit the maze).

    If the bots were behaving correctly, and respected my robots.txt, I wouldn’t need to fight them. But when they’re DDoSing my sites from literally thousands of IPs, generating millions of requests a day, I will go to extreme lengths to make them go away.