ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

  • NevermindNoMind@lemmy.world
    link
    fedilink
    English
    arrow-up
    38
    arrow-down
    1
    ·
    2 years ago

    This is interesting in terms of copyright law. So far the lawsuits from Sarah Silverman and others haven’t gone anywhere on the theory that the models do not contain a copies of books. Copyright law hinges on whether you have a right to make copies of a work. So the theory has been the models learned from the books but didn’t retain exact copies, like how a human reads a book and learns it’s contents but does not store an exact copy in their head. If the models “memorized” training data, including copyrighten works, OpenAI and others may have a problem (note the researchers said they did this same thing on other models).

    For the silicone valley drama addicts, I find it curious that the researchers apparently didn’t do this test on Bard of Anthropic’s Claude, at least the article didn’t mention them. Curious.

    • BetaDoggo_@lemmy.world
      link
      fedilink
      English
      arrow-up
      13
      ·
      2 years ago

      The paper suggests it was because of cost. The paper mainly focused on open models with public datasets as its basis, then attempted it on gpt3.5. They note that they didn’t generate the full 1B tokens with 3.5 because it would have been too expensive. I assume they didn’t test other proprietary models for the same reason. For Claude’s cheapest model it would be over $5000, and bard api access isn’t widely available yet.