AI’s Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

yoasif@fedia.io · 2 days ago

AI’s Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

neukenindekeuken@sh.itjust.works · edit-2 2 days ago

There is absolutely no way you’re using an LLM to rewrite the Linux kernel in any way. That’s not what they do, and whatever it produces wouldn’t be even a fraction of effective as the current kernel.

They’re text prediction machines. That’s it. Markov generators on steroids.

I’d also be curious about where that 15-20% productivity increase comes from in aggregate. That’s an extremely misleading statistic. The truth is there are no consensus data on any productivity improvements with LLMs today in aggregate. Anything anyone has is made up. It’s also not taking into account the additional bugs and issues caused by LLMs, which are significant, and also not a thing you want to have happening on every PR with kernel code, I promise.

Regardless of all of that, the companies with these LLMs are using free software to train their models to make money without making their models free and open source or providing a way for people to use it for free/open source projects, so this is a clear violation of every single FOSS license model I’m familiar with (most commonly used is the Apache one).

TL;DR; they are stealing code meant to be free and public with any derivative works, profiting off it, and then refusing to honor the license model of the code/project they stole.

This is illegal. The only reason why we’re not seeing a lot about it is these FOSS generally have no money and are not going to sue them and potentially lose a substantial sum of their negligible funds in court. That’s it. Otherwise, what they are doing is very illegal. The sort of thing any professional software development company you work for’s legal team warns you about the second you start using an OSS project in your for profit business application codebase.

LLMs get away with it because $$$$$$$$$$$$$$$$$. That’s it.

Edit: added link to security article with LLMs

melfie@lemy.lol · edit-2 2 days ago

I’d also be curious about where that 15-20% productivity increase comes from in aggregate.

This is from a Stanford study that is summarized here:

https://www.linkedin.com/pulse/does-ai-actually-boost-developer-productivity-striking-çelebi-tcp8f

There are other studies with different conclusions, but this one aligns with my own experience. To your point about how AI won’t reproduce the Linux kernel, this study also points out that AI is significantly less effective, even going into the negative, with complex codebases, which is in agreement with what you said, since the Linux kernel certainly qualifies as a complex codebase.

they are stealing code meant to be free and public with any derivative works, profiting off it, and then refusing to honor the license model of the code/project they stole.

I agree big tech is using open source unethically, but how much different is this situation from the other ways big tech profits from open source without contributing back? Training proprietary LLMs on open source code is shitty, rent-seeking behavior, but not really a unique development, and certainly not something that undermines the core value of open source.

yoasif@fedia.io · 1 day ago

Training proprietary LLMs on open source code is shitty, rent-seeking behavior, but not really a unique development, and certainly not something that undermines the core value of open source.

Destroying “share alike” doesn’t undermine the core value of open source? What IS the core value?

melfie@lemy.lol · edit-2 1 day ago

The LLMs are not distributing the GPL code, their weights are being trained on it. You can’t just have Copilot pump out something that works like the Linux kerne or Blender, except with different code that isn’t subject to the GPL license. At best, the AI can learn from it and assist humans with developing a proprietary alternative. In that case, it’s not really that much better than having humans study a GPL codebase and make a proprietary alternative without AI. It’s still going to cost a lot of money to replicate the thing no matter what, so why not just save money and use the GPL code and contribute back? Also, it’s going to be hard to sell your proprietary alternative, because why wouldn’t people just use the FOSS version?

yoasif@fedia.io · 1 day ago

You can’t “train” on code you haven’t copied. That is kind of obvious, right? So did they have the right to copy and then reproduce the work without attribution?

melfie@lemy.lol · 1 day ago

Yeah, I guess this is a bit of gray area. With GPL, you only have rights to code if it was distributed to you. In the case of GPL code that has only been distributed to select people and none of those people distributed it to the general public, but GitHub still trained their models on the private repo, then that would technically be in violation of the license. This would be a more niche scenario, though, since the intent normally is public distribution.