OpenAI strikes Reddit deal to train its AI on your posts

return2ozma@lemmy.world · 2 years ago

OpenAI strikes Reddit deal to train its AI on your posts

myliltoehurts@lemm.ee · 2 years ago

So they filled reddit with bot generated content, and now they’re selling back the same stuff likely to the company who generated most of it.

At what point can we call an AI inbred?

restingboredface@sh.itjust.works · 2 years ago

I wonder if Open AI or any of the other firms have thought to put in any kind of stipulations about monitoring and moderating reddit content to reduce ai generated posts and reduce risk of model collapse.

Anybody who’s looked at reddit in the past 2 years especially has seen the impact of ai pretty clearly. If I was running open ai I wouldn’t want that crap contaminating my models.

Blackmist@feddit.uk · 2 years ago

They always were.

Only now they’ve agreed to pay Reddit for it. This is what their third party lockdown was really all about.

They’re helping themselves to your Lemmy comments for free, as that’s just how it’s designed. If you post anything publicly anywhere, it’s getting slurped up by a bot somewhere.

Chadus_Maximus@lemm.ee · edit-2 2 years ago

What if I say the word gasp fuck?

Blackmist@feddit.uk · 2 years ago

Well they’ve probably got filters that remove all that before it teaches their Ai to swear. So you need to be more subtle for 𝑓ucks sake.

14th_cylon@lemm.ee · 2 years ago

These fuckers see it as well. Fuckity fuckity fuck.

jordanlund@lemmy.world · 2 years ago

BRB - changing my entire 15 year reddit comment history to “Fuck Spez”. LOL.

return2ozma@lemmy.world · 2 years ago

Know any bots or ways to perma delete all Reddit comments?

thejml@lemm.ee · 2 years ago

Reddit has backups, permanently isn’t an option.

metaStatic@kbin.social · 2 years ago

yep they fuckin got us

but it’s not like our posts are safe here either. This is the world we live in now.

the_doktor@lemmy.zip · 2 years ago

We have to either make AI illegal or make it accountable by giving references to where it gets its data so it can properly cite its sources.

db2@lemmy.world · 2 years ago

They’re not multiple though, edit it and then delete it and it’s gone. They disabled all the tools to do it though so it’s manually or nothing now.

Coasting0942@reddthat.com · 2 years ago

Damn. You outsmarted them well paid data jockeys. And assuming your edits change the actual comment and don’t simply hide the original.

I could be an idiot too though. Reddit might have been running this whole shit show on the original version of the database system and be upselling to buyers.

SchmidtGenetics@lemmy.world · 2 years ago

They just reload a previous cached comment, doesn’t matter how many times you edit or delete, it’s all logged and backed up.

Imgonnatrythis@sh.itjust.works · 2 years ago

Will be interesting to see if they stoop so low as to allow this. Probably wouldn’t be a super wise move as most deleted posts are likely material that would not be great to train on anyway. My first thought when I read this was, “well, not on MY posts” I’m clean off of reddit.

mox@lemmy.sdf.org · 2 years ago

There have already been reports of people being banned and finding their posts restored in response to their attempts to delete them.

FaceDeer@fedia.io · 2 years ago

There are torrents of complete Reddit comment archives available for any random person who wants them, I’m sure Reddit themselves has a comprehensive edit history of everything.

bobs_monkey@lemm.ee · edit-2 2 years ago

I used redact.dev to mass edit all my comments, worked pretty well. Problem is that if you mass delete, they’ll restore them pretty quick, but so far they haven’t reverted my edits.

catloaf@lemm.ee · 2 years ago

https://github.com/j0be/PowerDeleteSuite

jabathekek@sopuli.xyz · 2 years ago

This is what I used awhile ago to delete/edit all my comments multiple times.

Rolando@lemmy.world · 2 years ago

Back when I deleted all my comments, I was told I could claim to be in Europe and make a request citing the European law that Reddit has to follow. I think Reddit had a page where you could make the request, but of course it was hard to find.

micka190@lemmy.world · 2 years ago

Realistically, when you’re operating at Reddit’s scale, you’re probably keeping a history of each comment for analytics purposes.

RecluseRamble@lemmy.dbzer0.com · 2 years ago

That was really my thought - future iterations of Chat GPT won’t like spez very much.

Everythingispenguins@lemmy.world · 2 years ago

Some day historians will be able to look back at this moment and be able to determine it was what caused ChatGPT to become horny and weird.

assassin_aragorn@lemmy.world · 2 years ago

Only an idiot would decide to mindlessly trawl Reddit to train an LLM. They’ll be confused when their model suddenly is confidently wrong about everything and have no clue.

Everythingispenguins@lemmy.world · 2 years ago

You are a hundred percent right, but how many idiots are there out there?

assassin_aragorn@lemmy.world · 2 years ago

Uncountably many

Everythingispenguins@lemmy.world · 2 years ago

Sadly looks like we have an answer

https://lemmy.world/post/15712886

frickineh@lemmy.world · 2 years ago

My comment history was like 50% shitposting about the beauty industry and 50% hating on Christian fundamentalists. There’s honestly no way it won’t make AI at least a little bit worse, and I’m not mad about it.

Flying Squid@lemmy.world · 2 years ago

That AI is going to be super anti-Christian fundementalist (or possibly just anti-Christian), so maybe there is an upside.

AlexWIWA@lemmy.ml · 2 years ago

LLMs have been training on Reddit posts since at least 2012. Nothing really new here.

YIj54yALOJxEsY20eU@lemm.ee · 2 years ago

Now they get to train on all the “deleted” comments/posts as well.

SparrowRanjitScaur@lemmy.world · edit-2 2 years ago

Probably not, I’m sure they’re training on Reddit’s internal data set which likely includes all deleted posts.

YIj54yALOJxEsY20eU@lemm.ee · 2 years ago

Did you just say probably not then agree with me?

SparrowRanjitScaur@lemmy.world · edit-2 2 years ago

Ya, lol. Sorry, I’m not sure if I replied to the wrong comment or just misread your comment earlier. I agree with you.

YIj54yALOJxEsY20eU@lemm.ee · 2 years ago

Lol no worries

UnderpantsWeevil@lemmy.world · 2 years ago

It’s ground zero for Bots training on other Bots

filister@lemmy.world · edit-2 2 years ago

What makes you think that they are not scraping Lemmy too? The only reason they might not be is probably how niche Lemmy and the fediverse are, but I am sure there have been people already doing it.

Dr. Moose@lemmy.world · 2 years ago

Fediverse is designed to do exactly that. It’s free flow of information which is a good thing. Don’t let corporations hijack this beautiful concept. We all want information to be free.

olympicyes@lemmy.world · 2 years ago

I’m not mad about the scraping. The linkedin scraping case pretty much cemented that there was nothing that could be done to stop it. I’m just mad that I can no longer use the app of my choice. No such problem with Lemmy.

AlexWIWA@lemmy.ml · 2 years ago

Lemmy is even easier to scrape. Just set up your own instance, then read the database after activity pub pushes everything to you.

kia@lemmy.ca · 2 years ago

I’m sure they are, but Reddit probably provides these companies with lots of personalized metadata they collect just for them which they may not get from Lemmy.

Possibly linux@lemmy.zip · edit-2 2 years ago

They now are paying Reddit? I thought they could just scrape for free.

Also, you can not delete anything on the internet. Once something is public there will always be a copy somewhere.

Fetus@lemmy.world · 2 years ago

Scraping through a website at the scale they are talking about isn’t really viable. You need access to the API so that you can have very targeted requests.

This is why reddit changed their API pricing and screwed over everyone using third party apps. They can make more money selling access to LLM trainers than they could from having millions of people using apps that rely on the API.

Dr. Moose@lemmy.world · edit-2 2 years ago

Scraping at scale is actually cheaper than buying API access. It’s a massive rising market, try googling “web scraping service” and there are hundreds of services that provide API to scrape any public web page and bypass the blocks for you and render all of the javascript.

BatrickPateman@lemmy.world · 2 years ago

Scraping ia nice for static conten, no doubt. But I wonder at what point it is easier to request changes to a developing thread via API than to request the whole page with all nested content over and over to find the new answes in there.

Dr. Moose@lemmy.world · 2 years ago

Following a developing thread is a very tiny use case I’d imagine and even then you can just scrape the backend API that is used on the public page for the same results as private API.

micka190@lemmy.world · edit-2 2 years ago

There’s actually legal precedent against scrapping a website through unofficial channels, even if the information is public. But basically, if you scrape a website and hinder their ability to operate, it falls under “virtual trespassing”.

I’m assuming it would be even worse now that everyone is using the cloud and that scrapping their site would cause a noticeable increase in resource cost (and thus, directly cost them more money because of cloud usage fees).

It’s why APIs are such a big deal. They provide you with an official, controlled, entry point to a platform’s data.

Dr. Moose@lemmy.world · edit-2 2 years ago

It’s the opposite! There’s legal precedence that scraping public data is 100% legal in the US.

There are few countries where scraping is illegal though like Japan and China. European countries often also have things called “database protection” laws that forbid replicating public databases through scraping or any other means but that has to be a big chunk of overal database. Also there are personally identifiable info (PII) protection laws that protect storing of people data without their consent (like GDPR).

Source: I work with anti bot tech and we have to explain this to almost every customer who wants to “sue the web scrapers” that lol if Linkedin couldn’t do it, you’re not sueing anyone.

General_Effort@lemmy.world · 2 years ago

Refreshing to see a post on this topic that has its facts straight.

EU copyright allows a machine-readable opt-out from AI training (unless it’s for scientific purposes). I guess that’s behind these deals. It means they will have to pay off Reddit and the other platforms for access to the EU market. Or more accurately, EU customers will have to pay Reddit and the other platforms for access to AIs.

nondescripthandle@lemmy.dbzer0.com · 2 years ago

My guess is reddit was cheap enough that it made sense to pay them as sort of insurance they dont get sued in the future.

Dark_Dragon@lemmy.dbzer0.com · edit-2 2 years ago

Reddit banned me through IP address or something. Whatever new account i create will be banned within 24hrs even if i don’t upvote a single post or comment. I tried with 10 new account all banned and all new email address. So gave up and randomly changed all my good comments. Shifted permanently to lemmy. Missing some of the most niche community. But not so much to return to reddit.

Edit: I didn’t even commit any rule violation. Took a too long to change from modded reddit app. I only logged in once. That doesn’t amount to blocking me from every using reddit.

dumblederp@lemmy.world · 2 years ago

If you use a vpn and a disposable email you can get about a week out of an account if you need to comment, it’ll get quietly shadowbanned though.

Dr. Moose@lemmy.world · 2 years ago

This form of propaganda is my pet peeve. It’s not “your posts” as soon as you put something to public you don’t get to eat your cake. It’s out there, you shared it. Don’t share it if you don’t want humanity to ingest and use it.

Dataprolet@lemmy.dbzer0.com · 2 years ago

You’re technically right, but nobody anticipated and therefore agreed on their posts being used for training LLMs.

SparrowRanjitScaur@lemmy.world · 2 years ago

Public information is public information.

Dataprolet@lemmy.dbzer0.com · 2 years ago

Oh boy have I bad news for you. You ever heard of copyright?

SparrowRanjitScaur@lemmy.world · 2 years ago

Have you ever heard of fair use?

Azzu@lemm.ee · edit-2 2 years ago

It’s not about it being used to train AI. It’s about the AI either not being open source/I don’t get access to it (i.e. not benefitting me) or reddit being paid for my comments (i e. also not benefitting me).

If this AI training would get me or the public access to the AI, or I would be paid for my comments instead of Reddit, I’d be fine with it.

Dr. Moose@lemmy.world · edit-2 2 years ago

yeah but you don’t get to choose that. You give away that right as soon as you participate in public discourse. It’s a zero sum game - either it’s a public for everyone or no one.

Don’t get me wrong, Reddit is a bitch but I think people want to cut their noses off to spite their faces here. It’s much more important to have free information flow than to fuck reddit.

My fear is that people will vote in some really dumb rules to spite AI and restrict free information flow accidentally.

Azzu@lemm.ee · edit-2 2 years ago

That’s how it is currently and maybe also your opinion. But that doesn’t mean it has to be like that in a society. It’s your opinion that everything public can go private at any time (training proprietary private AI), but we can decide as a society that’s not how we want to do things. We can require stuff that used public data to be public as well.

And yeah I kinda get to choose that. As democratic society, anything that the public (i.e. including me) decides, goes. Of course, if there are people like you that don’t want stuff trained on public data to be required to be public, democracy will also work in the sense that we don’t get that, as it is currently.

boatsnhos931@lemmy.world · 2 years ago

No wonder AI is crazy AF.

macrocephalic@lemmy.world · 2 years ago

All future AI will have autocorrect errors and will look like no one read it before hitting enter. You’re welcome.

boatsnhos931@lemmy.world · 2 years ago

No one says thank you, we already have that. WAIT JUST A GOT DAMN MINUTE!! YOU ARE ONE OF THEMS!!

Mastengwe@lemm.ee · 2 years ago

Isn’t this news like every month?

noorbeast@lemmy.zip · 2 years ago

Finally found a use for MS Edge, loaded up Nuke Reddit History and removed all comments and posts: https://microsoftedge.microsoft.com/addons/detail/nuke-reddit-history/bklbcgohenjegdibgmppligaapohkgip

gravitas_deficiency@sh.itjust.works · 2 years ago

Hate to break it to you, but the time to do that was over a year ago, and even then it wasn’t ever really a sure thing - we don’t really know what their backup policies are around that stuff.

This is what the former power user community that made an exodus from Reddit roughly a year ago has been trying to communicate, but a ton of people here seem to enjoy keeping their toes in the water over there, with rather predictable consequences (literally, the post we’re commenting on).

All that said: I am very much looking forward to the absolutely titanic lawsuit around GDPR I’m sure is in the works over this.

AlexWIWA@lemmy.ml · 2 years ago

Not even a year ago. Reddit has been used for training data for well over a decade. We used it in 2012 in an AI class.

gravitas_deficiency@sh.itjust.works · 2 years ago

My point is that there was not a revenue-generating b2b contract allowing another company to exploit it at scale, while compensating Reddit directly.

AlexWIWA@lemmy.ml · 2 years ago

My apologies. I missed it

humorlessrepost@lemmy.world · 2 years ago

Worth doing, but I suspect they’re sending OpenAI snapshots of the database from before you did that.

snownyte@kbin.social · 2 years ago

Wish I had known this beforehand in like several accounts I’ve had with that shit-ass place.

Then again, it’s likely that Reddit has shit archived because Spez is one of them data-farmers like Mark is. Nothing is truly deleted from their sites. It’s just archived.

There’s been lots of evidence that proves this, because people have dug up old comments, even down to who posted it originally. Then, even if your account is deleted, your comment body is still there, I know because I’ve deleted an account and checked back where I was before.

jeanofthedead@sh.itjust.works · 2 years ago

Does this mean I can stop prefacing my AI requests with “According to Reddit…”?

RizzRustbolt@lemmy.world · 2 years ago

Those poor silicon atoms…

Ex Nummis@lemmy.world · 2 years ago

I didn’t delete my comments before nuking my account, but I’m pretty sure the grand majority were shitposts containing ample amounts of smut, gore and other ridiculous over the top shit. So I consider this a win.

OpenAI strikes Reddit deal to train its AI on your posts

OpenAI strikes Reddit deal to train its AI on your posts

Reddit’s deal with OpenAI will plug its posts into “ChatGPT and new products”