• mm_maybe@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    2 months ago

    Scaling laws are disputed, but if an effort has in fact already been undertaken to train a general purpose LLM using only permissively-licensed data, great! Can you send me the checkpoint on Huggingface, a github page hosting relevant code, or even a paper or blog post about it? I’ve been looking and hadn’t found anything like that yet.

    • General_Effort@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 months ago

      Scaling laws are disputed

      Not in general.

      There is not enough permissively licensed text to train models of any size, and what there is, lacks in diversity. Wikipedia, government documents, stack overflow, century old stuff, … An LLM trained on that is not likely to be called “general purpose”, because scaling laws. Sometimes such small models are trained for research purposes but I don’t have a link ready. They are not something you’d actually use. Perhaps you could look at Microsoft’s Phi series of models. They are trained on synthetic data, though that’s probably not what you are looking for.