• Wrench@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    2
    ·
    2 months ago

    Both can be true.

    Preserved and curated datasets to train AI on, gathered before AI was mainstream. This has the disadvantage of being stuck in time, so-to-speak.

    New datasets that will inevitably contain AI generated content, even with careful curation. So to take the other commenter’s analogy, it’s a shit sandwich that has some real ingredients, and doodoo smeared throughout.

    • FaceDeer@fedia.io
      link
      fedilink
      arrow-up
      5
      arrow-down
      2
      ·
      2 months ago

      They’re not both true, though. It’s actually perfectly fine for a new dataset to contain AI generated content. Especially when it’s mixed in with non-AI-generated content. It can even be better in some circumstances, that’s what “synthetic data” is all about.

      The various experiments demonstrating model collapse have to go out of their way to make it happen, by deliberately recycling model outputs over and over without using any of the methods that real-world AI trainers use to ensure that it doesn’t happen. As I said, real-world AI trainers are actually quite knowledgeable about this stuff, model collapse isn’t some surprising new development that they’re helpless in the face of. It’s just another factor to include in the criteria for curating training data sets. It’s already a “solved” problem.

      The reason these articles keep coming around is that there are a lot of people that don’t want it to be a solved problem, and love clicking on headlines that say it isn’t. I guess if it makes them feel better they can go ahead and keep doing that, but supposedly this is a technology community and I would expect there to be some interest in the underlying truth of the matter.