OpenAI Pleads That It Can’t Make Money Without Using Copyrighted Materials for Free

flop_leash_973@lemmy.world · 2 months ago

OpenAI Pleads That It Can’t Make Money Without Using Copyrighted Materials for Free

mm_maybe@sh.itjust.works · 2 months ago

Scaling laws are disputed, but if an effort has in fact already been undertaken to train a general purpose LLM using only permissively-licensed data, great! Can you send me the checkpoint on Huggingface, a github page hosting relevant code, or even a paper or blog post about it? I’ve been looking and hadn’t found anything like that yet.

General_Effort@lemmy.world · 2 months ago

Scaling laws are disputed

Not in general.

There is not enough permissively licensed text to train models of any size, and what there is, lacks in diversity. Wikipedia, government documents, stack overflow, century old stuff, … An LLM trained on that is not likely to be called “general purpose”, because scaling laws. Sometimes such small models are trained for research purposes but I don’t have a link ready. They are not something you’d actually use. Perhaps you could look at Microsoft’s Phi series of models. They are trained on synthetic data, though that’s probably not what you are looking for.

mm_maybe@sh.itjust.works · 2 months ago

yes, I’ve extensively written about Phi and other related issues in a blog post which I’ll share here: https://medium.com/@matthewmaybe/data-dignity-is-difficult-64ba41ee9150