• jagged_circle@feddit.nl
    link
    fedilink
    English
    arrow-up
    23
    arrow-down
    35
    ·
    edit-2
    3 months ago

    This is fine. I support archiving the Internet.

    It kinda drives me crazy how normalized anti-scraping rhetoric is. There is nothing wrong with (rate limited) scraping

    The only bots we need to worry about are the ones that POST, not the ones that GET

    • zod000@lemmy.ml
      link
      fedilink
      English
      arrow-up
      17
      ·
      3 months ago

      Bullshit. This bot doesn’t identify itself as a bot and doesn’t rate limit itself to anything that would be an appropriate amount. We were seeing more traffic from this thing that all other crawlers combined.

      • jagged_circle@feddit.nl
        link
        fedilink
        English
        arrow-up
        3
        arrow-down
        4
        ·
        edit-2
        3 months ago

        Not rate limiting is bad. Hate them because of that, not because they’re a bot.

        Some bots are nice

        • Zangoose@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          3 months ago

          Even if they were rate limiting they’re still just using the bot to train an AI. If it’s from a company there’s a 99% chance the bot is bad. I’m leaving 1% for whatever the Internet Archive (are they even a company tho?) is doing.

        • zod000@lemmy.ml
          link
          fedilink
          English
          arrow-up
          2
          ·
          2 months ago

          I don’t hate all bots, I hate this bot specifically because:

          • they intentionally hide that they are a bot to evade our, and everyone else’s, methods of restricting which bots we allow and how much activity we allow.
          • they do not respect the robots.txt
          • the already mentioned lack of rate limiting
    • WhyJiffie@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5
      ·
      2 months ago

      this is neither archiving, nor ratelimited, if the AI training purpose and the 25 times faster scraping than a large company did not make it obvious

      • tempest@lemmy.ca
        link
        fedilink
        English
        arrow-up
        2
        ·
        2 months ago

        The type of request is not relevant. It’s the cost of the request that’s an issue. We have long ago stopped serving html documents that are static and can be cached. Tons of requests can trigger complex searches or computations which are expensive server side. This type of behavior basically ruins the internet and pushes everything into closed gardens and behind logins.

        • Olgratin_Magmatoe@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          2 months ago

          It has nothing to do with a sysadmin. It’s impossible for a given request to require zero processing power. Therefore there will always be an upper limit to how many get requests can be handled, even if it’s a small amount of processing power per request.

          For a business it’s probably not a big deal, but if it’s a self hosted site it quickly can become a problem.

          • jagged_circle@feddit.nl
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            1
            ·
            2 months ago

            Caches can be configured locally to use near-zero processing power. Or moved to the last mile to use zero processing power (by your hardware)

              • jagged_circle@feddit.nl
                link
                fedilink
                English
                arrow-up
                1
                arrow-down
                1
                ·
                2 months ago

                Right, thats why I said you should fire your sysadmin if they aren’t caching or can’t manage to get the cache down to zero load for static content served to simple GET requests