AI’s Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

yoasif@fedia.io · 2 days ago

AI’s Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

yoasif@fedia.io · 2 days ago

Do you understand how free software works? Did you read the post? I’d love to clarify, but I’m not going to rewrite the article.

atzanteol@sh.itjust.works · 2 days ago

Yes. And this is kinda hand-wavy bullshit.

By incorporating copyleft data into their models, the LLMs do share the work - but not alike. Instead, the AI strips the work of its provenance and transforms it to be copyright free.

That’s not how it works. Your code is not “incorporated” into the model in any recognizable form. It trains a model of vectors. There isn’t a file with your for loop in there though.

I can read your code, learn from it, and create my own code with the knowledge gained from your code without violating an OSS license. So can an LLM.

yoasif@fedia.io · 1 day ago

I can read your code, learn from it, and create my own code with the knowledge gained from your code without violating an OSS license.

Why is Clean-room design a thing then?

atzanteol@sh.itjust.works · 1 day ago

create my own code with the knowledge gained from your code

Not copy your code. Use it to learn what algorithms it uses and ideas on how to implement it.

VoterFrog@lemmy.world · 1 day ago

I can read your code, learn from it, and create my own code with the knowledge gained from your code without violating an OSS license. So can an LLM.

Not even just an OSS license. No license backed by law is any stronger than copyright. And you are allowed to learn from or statistically analyze even fully copyrighted work.

Copyright is just a lot more permissive than I think many people realize. And there’s a lot of good that comes from that. It’s enabled things like API emulation and reverse engineering and being able to leave our programming job to go work somewhere else without getting sued.

atzanteol@sh.itjust.works · 2 days ago

Also - this conclusion is ridiculous:

By incorporating copyleft data into their models, the LLMs do share the work - but not alike. Instead, the AI strips the work of its provenance and transforms it to be copyright free.

That is absolutely not true. It doesn’t remove the copyright from the original work and no court has ruled as such.

If I wrote a “random code generator” that just happened to create the source code for Microsoft Windows in entirety it wouldn’t strip Microsoft of its copyright.