• interdimensionalmeme@lemmy.ml
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    3
    ·
    edit-2
    5 days ago

    Just provide a full dump.zip plus incremental daily dumps and they won’t have to scrape ?
    Isn’t that an obvious solution ? I mean, it’s public data, it’s out there, do you want it public or not ?
    Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhas

    • dwzap@lemmy.world
      link
      fedilink
      English
      arrow-up
      21
      ·
      4 days ago

      The Wikimedia Foundation does just that, and still, their infrastructure is under stress because of AI scrapers.

      Dumps or no dumps, these AI companies don’t care. They feel like they’re entitled to taking or stealing what they want.

      • interdimensionalmeme@lemmy.ml
        link
        fedilink
        English
        arrow-up
        6
        ·
        edit-2
        4 days ago

        That’s crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.

        They also have an open API that makes scraper entirely unnecessary too.

        Here are the relevant quotes from the article you posted

        “Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024.”

        “At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots.”

        “Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure.”

        And it’s wikipedia ! The entire data set is trained INTO the models already, it’s not like encyclopedic facts change that often to begin with !

        The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.

        Maybe it’s consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.

        If the internet wasn’t becoming a warzone, there really wouldn’t be a need for more than one scraper to scrape a site, even if the site was hostile, like facebook, it only need to be scraped once and then the data could be shared over a torrent swarm efficiently.

    • 0x0@lemmy.zip
      link
      fedilink
      English
      arrow-up
      6
      ·
      4 days ago

      they won’t have to scrape ?

      They don’t have to scrape; especially if robots.txt tells them not to.

      it’s public data, it’s out there, do you want it public or not ?

      Hey, she was wearing a miniskirt, she wanted it, right?

      • interdimensionalmeme@lemmy.ml
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        4
        ·
        4 days ago

        No no no, you don’t get to invoke grape imagery to defend copyright.

        I know, it hurts when the human shields like wikipedia and the openwrt forums are getting hit, especially when they hand over the goods in dumps. But behind those human shields stand facebook, xitter, amazon, reddit and the rest of big tech garbage and I want tanks to run through them.

        So go back to your drawing board and find a solution the tech platform monopolist are made to relinquish our data back to use and the human shields also survive.

        My own mother is prisoner in the Zuckerberg data hive and the only way she can get out is brute zucking force into facebook’s poop chute.

        • 0x0@lemmy.zip
          link
          fedilink
          English
          arrow-up
          3
          ·
          4 days ago

          find a solution the tech platform monopolist are made to relinquish our data

          Luigi them.
          Can’t use laws against them anyway…

    • qaz@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 days ago

      I think the issue is that the scrapers are fully automatically collecting text, jumping from link to link like a search engine indexer.