I assume they all crib from the same training sets, but surely one of the billion dollar companies behind them can make their own?
I assume they all crib from the same training sets, but surely one of the billion dollar companies behind them can make their own?
It’s not easy. LLMs take so much training data that at this point, their training data is basically, all books publically available, all blogs on the internet, pretty much all of tumblr, Reddit, stack overflow and every forum you can think of. Even then, some LLMs need even more data. So companies have started outright stealing data - pirating stuff, downloading stuff from Anna’s Archive, etc.
So no, no billion dollar company can make their own training data. Even if you plug in every email ever sent on Gmail, Google still won’t have enough data to train a good LLM. So they go with the cheaper option- training data that has already been collected, sorted, cleaned, and labeled.
In one sense, they’re again stealing others’ hard work - rather than cleaning their own data, they use public data sets. In another sense, even that’s not enough.
So is it like planting the same seeds into different soils, and expecting to get different fruits?
That’s an extreme simplification, but yes, that’s the gist.
This statement brought along with it the terrifying thought that there’s a dystopian alternative timeline where companies do make their own training data, by commissioning untold numbers of scientists, engineers, artists, researchers, and other specialties to undertake work that no one else has. But rather than trying to further the sum of human knowledge, or even directly commercializing the fruits of that research, that it’s all just fodder to throw into the LLM training set. A world where knowledge is not only gatekept like Elsevier but it isn’t even accessible by humans: only the LLM will get to read it and digest it for human consumption.
Written by humans, read by AI, spoonfed to humans. My god, what an awful world that would be.
We’re already living in it. Professional voice actors now have the choice between vying for the dwindling number of voice acting gigs or selling their voice (via commissioned recordings) to LLM companies as training data.
Well, technically, the "AI"s… can generate their own additional training data…
But then trying to train another AI on said AI-generated data… well, then the AI starts to develop toward model collapse, basically, it gets more stupid and incoherent, develops weirder and stronger ‘quirks’.
But yeah, as far as I see it, basically zero chance an LLM advances beyond ‘very fancy autocomplete’ toward AGI or a capacity for actual metacognition, to think about its own thinking and then try and modify that.
Sorry, but you’re not gonna get a super intelligence if it isn’t capable of actually assessing and correcting itself.