fiat_lux 🆕 🏠

fiat_lux 🆕 🏠@lemmy.zip · 18 hours ago

We can see that it’s solved by the fact that AI models continue to get better despite an increasing amount of AI-generated data being present in the world that training data is being drawn from.

Even if it logically followed that model improvement means model collapse is a solved problem, which it absolutely doesn’t, even the premise that models are improving to a significant degree is up for debate.

MMLU pro benchmark over time line graph showing plateauing values Massive Multitask Language Understanding (MMLU) benchmark vs time 07-2023 to 01-2026

A lot of people really want to believe that AI is going to just “go away” somehow, and this notion of model collapse is a convenient way to support that belief

Model collapse may for some people be an argument used to support a hope that AI will go away, but the reality of that hope does not alter the validity of the model collapse problem.

You can tell it’s not a solved problem because researchers are still trying to quantify the risk and severity of collapse - as you can see even just from the abstracts in the links I provided.

Some choice excerpts from the abstracts, for those who don’t want to click the links:

Our results show that even the smallest fraction of synthetic data (e.g., as little as 1% of the total training dataset) can still lead to model collapse

…we establish … that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions … are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set.

fiat_lux 🆕 🏠@lemmy.zip · 20 hours ago

It can’t only be from data from previous generations, even if the initial demonstration used that, because that would mean a single piece of human-generated text is sufficient to avoid collapse.

The loss of data from generation to generation is one way model collapse can occur, but it’s only one way. The actual issues that cause collapse are replication of errors and increasing data homogeneity. In a world where an unknown quantity of new data is AI generated, it is not possible to ensure only a certain quantity is used as future training data.

Additionally, as new human generated content is based on the information provided by AI, even if not used intentionally in the construction of the text itself, the error replication and data diversity issues cross over from being only an AI-generated content problem to an all content problem. You can see examples of this happening now in the media where a journalist relies on AI output to fact check, and then the article with the error gets republished by other media outlets.

Real AI training methods may stave off some model collapse, if we ignore existing issues around the cultural homogeneity of training data from across all time periods, or assume the models are sufficiently weighted to mitigate those issues, but it’s by no means settled that collapse is a non-problem.

You’ve mentioned using data mixing to prevent collapse, but some of the research suggests that even iterative mixing isn’t sufficient dependent on the quantities of real vs synthetic data. Strong Model Collapse (2024), Dohmatob, Feng, Subramonian, Kempe goes into that, and since then there’s been When Models Don’t Collapse: On the Consistency of Iterative MLE (2025) Barzilai, Shamir which presents one theoretical case where collapse won’t occur provided some assumptions hold, but the math is beyond me. They also note multiple situations where near-instant collapse can occur.

How much data poisoning might affect any of that is not at all clear, it would need to be in sufficient quantity for whatever model to have an effect, but it certainly wouldn’t help. The recent Bixonimania scandal suggests it’s feasible.

fiat_lux 🆕 🏠@lemmy.zip · 23 hours ago

“model collapse” was demonstrated by repeatedly training generation after generation of models on the output of previous generations

the best models these days are trained largely on synthetic data - data that’s been pre-processed by other AIs to turn it into stuff that makes for better training material

You can prevent model collapse simply by enriching the training data with good data - stuff that is already archived, that can’t be “contaminated."

This feels like an odd juxtaposition.

If model collapse can be avoided by enriching with uncontaminated data, and model collapse comes from using training data generated by previous generations, doesn’t that imply that:

Either the best models are headed towards model collapse, or,
Models can’t be updated because modern data isn’t usable?

fiat_lux 🆕 🏠@lemmy.zip · 2 days ago

Congratulations. There’s something about convincing a cat you’re a source of enjoyment that is ridiculously rewarding. You earned those purrs.

I hope he remembers that he enjoyed this experience so you can both keep enjoying future purring!

fiat_lux 🆕 🏠@lemmy.zip · 2 days ago

There are two surprising aspects of this to me. Firstly that the employees feel confident enough to express concern about Palantir’s actions in official channels. I would have thought that the nature of their work was obvious enough that this would be a cultural taboo and therefore self-censored. I guess some of them have limits to suspending disbelief for what they had likely internally framed as “work for the benefit of national security” or “job pays too well to care”.

The second part is that not all of this official channel discussion was immediately wiped by Palantir, but perhaps they also relied on the premise of self-censorship in preventing these conversations at scale.

Either way, I’m somewhat relieved there’s someone at Palantir worried about this at all. The more of them who are worried by this, the more leaks we’ll see.