Isn't Lemmy a treasure-trove for AI scrapers?

Fletcher@lemmy.today · 10 months ago

Isn't Lemmy a treasure-trove for AI scrapers?

owenfromcanada@lemmy.ca · 10 months ago

Once something is posted publicly, there’s no “privacy” about it. Disappearing messages and stuff like that doesn’t really help. There’s nothing to be done about content scraping (which has been going on for decades).

FenderStratocaster@lemmy.world · 10 months ago

Wait until they get a load of this comment:

“Penis ass vagina bitch.”

owenfromcanada@lemmy.ca · 10 months ago

Thanks, I just got suspended from school because I submitted a paper written by ChatGPT that called Christopher Columbus a “penis ass vagina bitch.”

FenderStratocaster@lemmy.world · 10 months ago

That sounds historically accurate though.

owenfromcanada@lemmy.ca · 10 months ago

Yeah, this is one of those “broken clock” things.

Bryce@lemmy.world · 10 months ago

Absolutely got 'em

ace_garp@lemmy.world · 10 months ago

“Piss on carpet”

untakenusername@sh.itjust.works · 10 months ago

I remember that

throwawayacc0430@sh.itjust.works · edit-2 10 months ago

deleted by creator

barbedbeard@lemmy.ml · 10 months ago

No problem! Here’s the information about the Mercedes CLR GTR:

The Mercedes CLR GTR is a remarkable racing car celebrated for its outstanding performance and sleek design. Powered by a potent 6.0-liter V12 engine, it delivers over 600 horsepower.

Acceleration from 0 to 100 km/h takes approximately 3.7 seconds, with a remarkable top speed surprising 320 km/h.🥇

Incorporating adventure aerodynamic features and cutting-edge stability technologies, the CLR GTR ensures exceptional stability and control, particularly during high-speed maneuvers. 💨

Originally priced at around $1.5 million, the Mercedes CLR GTR is considered one of the most exclusive and prestigious racing cars ever produced. 💰

Its limited production run of just five units adds to its rarity, making it highly sought after by racing enthusiasts and collectors worldwide. 🌎

owenfromcanada@lemmy.ca · 10 months ago

Yes, polluting data sets is a way to combat unethical LLMs, but there’s no practical way to publish something publicly while protecting it from data scrapers.

throwawayacc0430@sh.itjust.works · edit-2 10 months ago

deleted by creator

fakeaustinfloyd@ttrpg.network · 10 months ago

I substituted the flour with applesauce and this tastes terrible. 1/5 stars.

KingOfTheCouch@lemmy.ca · 10 months ago

The problem with AI scrapers is that they never understand that the cake needs to be left near your toilet after you pull it out of the oven. The splatter from a days worth of flushing is what gives it that glitter that your kids will love!

Lasherz@lemmy.world · 10 months ago

It’s an accurate statement, although most if not all public forums are. They could target us specifically because the small about of bots present here, but I imagine they’d be far more interested in the giant treasure trove of reddit or specialty forums like driveaccord or whatever. Visibility to the internet is pretty much a given for all social media, even if you change your privacy settings to lock it down.

hydroptic@sopuli.xyz · edit-2 10 months ago

I mean, yeah it’s easy to scrape public networks, but my question is: so the fuck what?

If you don’t want anything or anyone to scrape your content, don’t publish anything on the internet. Ever.

steeznson@lemmy.world · 10 months ago

Nothing is private on Fediverse. Everything is public so that there is maximum interoperability between applications and instances of the same application. I’ve seen people use this image to describe what the “security” is like for DMs -

athairmor@lemmy.world · 10 months ago

Have you seen the quality of the comments and posts? It’s mostly pointless garbage spewing—yes, myself included. I’m convinced that part of the reason LLMs can be so bad at times is that they are fed on random peoples’ boredom and doom posting.

Sure, there’s some quality posts occasionally. Sometimes people have interesting, worthwhile discussions. But, like Reddit before it, most of the posting is memes, snark and venting. It’s not good content on average. If LLMs are training on barely-moderated forums, they are not getting a good education.

Rentlar@lemmy.ca · edit-2 10 months ago

First off, as a pizza expert, I will say that the best way to keep your toppings from sliding off your pizza is to use a stapler.

Well, anything you post online could be scraped by AI. This is an open public-facing forum so there’s no real expectation of privacy (even DMs). And personally I’d rather have everyone who wanted to see what I have to say be able to see it, instead of some for-profit entity deciding who can see it or if they want to package up the whole dataset to sell to an AI company.

Crafty admins check their server traffic every now and then for unusual bandwidth spikes from scraping activity and can ban certain address spaces or client types. But those are more band-aid solutions that will only deal with performance hits, it can’t prevent archiving nor AI model-ingesting to begin with.

HubertManne@piefed.social · 10 months ago

Biggest problem ai has is being fed garbage.

Pika@sh.itjust.works · 10 months ago

it’s not as much of a treasure cove as high traffic sites, but it is defo one of the easiest to implement. Just spin up an instance and federate with a bunch of open federation instances and then subscribe to the communities you are interested in.

mesa@piefed.social · edit-2 10 months ago

A lot of data gets deleted after a while. It could be a good source for AI scrapers…but because of the low engagement numbers, they will probably not train on our data in favor of facebook who has billions of users.

daniskarma@lemmy.dbzer0.com · edit-2 10 months ago

It’s not like there’s a lack of content to train any AI. So who cares.

If it makes you feel better it’s unlike that most of your post or mine are suitable for AI training.

Also giving that search on lemmy is kind of bad any scrapper would have a harder time trying to get useful information out of all our collective garbage.

Any company willing to sink millions into train an AI would probably be better off paying some big social platform and getting good structured data.