• 2 Posts
  • 22 Comments
Joined 1 year ago
cake
Cake day: June 14th, 2023

help-circle

  • Now I sail the high seas myself, but I don’t think Paramount Studios would buy anyone’s defence they were only pirating their movies so they can learn the general content so they can produce their own knockoff.

    We don’t know exactly how they source their data (and that is definitely shady), but if I can gain access to a movie in a legal way, I don’t see why I would not be able to gather statistics from said movie, including running a speech to text model to caption it, then make statistics of how many times a few words were used, and followed by which ones. This is an oversimplified explanation of what a LLM does, but it’s the fairest I can come up, and it would be legal to do so. The models are always orders of magnitude smaller than the data they are trained on.

    That said, I don’t imply that I’m happy with the state of high tech companies, the AI hype, the energy consumption, or the impact on the humble people. But I’ve put a lot of thought into this (and learning about machine learning for real), and I think this is not a ML problem, but a problem in the economic, legal and political system. AI hype is just a symptom.


  • But then it does go on to quote materials verbatim, which shows it’s not “just” ‘extracting patterns’.

    Is is just extracting patterns. Is making statistical samples of which token (“word”, informally speaking) is likely followed given the previous stream.

    It can only reproduce passages of things it has seen many, many times. I cannot reproduce the whole work. Those two quotes can be seen elsewhere on the internet plenty of times. And it’s fair use there, so it would be fair use with a chat bot as well.

    There have been papers published where researchers were able to regenerate an image that was present in the training set of Stable Diffusion. But they were only able to find that image (and others) in particular, because they were present in the training set multiple times, and the caption was the same (it was the portrait picture of some executive at a company).

    when given the book and pages — quote copyrighted works

    Yeah, you are not gonna be able to do that with an LLM. They will be able to quote only some passages, and only of popular books that have been quoted often enough.

    Even if they started to use my service to literally copy entire books?

    You cannot do that with an LLM.

    Why are you defending massive corporations who could just pay up? Isn’t the whole “corporations putting profits over anything” thing a bit… seen already?

    I hate that some corporations are burning money, resources and energy on this, and the solution is not to restrict fair use even further. Machine Learning is complex, but if I had to summarize in some way is “just” gathering statistics of which word comes next (in the case of a text model). This is no different than getting a large corpus of text, and sample it for word frequency, letter frequency, N-gram frequency, etc. It is well known that this is fair use. You only store the copyrighted works to run the software and produce a very transformative work that is a summary many orders of magnitude smaller than the copyrighted work. This is fair use, and it should still be. Changing that is gonna harm the public, small companies and independent researchers way more than big tech companies.

    As I said in another comment, I would very much welcome a way to force big corpos to release their models. Make a model bigger than N parameters? You needed too much fair use in one gulp: your model has to be public, and in the public domain. I would fucking welcome that! But going in the opposite direction is just risky.

    I don’t understand why small individuals think that copyright is their friend, and will protect them from big tech companies. Copyright will always harm the weak and protect the powerful as a net result. It’s already a miracle that we can enjoy free software and culture by licenses that leverage copyright in our favor.


  • “Theft” is never a technically accurate word when dealing with the so called “intellectual property”, because the digital content being copied without authorization is legal in tons of cases, and because, come on, property is very explicitly exclusive. I cannot copy my house or my car, but I can make copies of my works for virtually 0 cost.

    Using data for training ML models is even explicitly allowed in some jurisdictions (e.g. Japan), and is likely to be fair use everywhere else. LLMs are very transformative, and while they often can produce verbatim copies of fragments of copyrighted works, they don’t store the whole works or significant pieces of them.

    Don’t get me wrong, I don’t like big companies making big money. I would not mind a law that would force models to be open sourced. But restricting them to train their models on public data by restricting fair use, it would harm them very little (they could pay something if they are making some profit), while small researchers or companies would never be able to compete, because they would not have the upfront costs, nor the economic engineering to disguise profits and pay less.


  • Yes. There is already an answer with many votes saying so, but I’ll add myself to the list.

    I don’t have to like all the language, and not even all of the standard library. I learnt C++ with the Qt library, and I still do 99% of my development using Qt because it’s the kind of software that I like to write the most. I can choose the parts that I like the most about the full C++ ecosystem, like most people do (you would have to see how different game development is, for example).

    I’m also learning Rust, and I see nothing wrong with it. It’s just that I see C++ better for the kind of stuff that I need to write (at this time at least).








  • suy@programming.devtoProgramming@programming.dev...
    link
    fedilink
    arrow-up
    7
    arrow-down
    1
    ·
    5 months ago

    Related: There is an article on LWN called Lua and Python, which is mostly about the approach of the two languages WRT being “batteries included” or not.

    I think Lua being a bit barebones is 100% fine… if you just pair it with a good helper library, or set of libraries with a coherent API, that allows it to thrive. Then you can either use the framework library or not, depending on whether your project requires the extras, or can do without.

    As a parallel, I’ve been doing C++ development for almost two decades, and I cannot imagine doing anything non-trivial without Qt. For example, Qt has a debug framework that pretty prints automatically most containers, and adds the newline also automatically. Also, QString is an actual string type, whereas std::string is more like QByteArray. It’s functionality that it’s essential for me (and it’s just the minimal examples… then Qt has all the GUI functionality, of course, but I use Qt even in console-only programs!).

    This is surely opinionated on my side, and most C++ devs don’t see it this way, but my point is that a language with a “core experience” that it’s lackluster to you should not be a bad thing if the language is capable enough to provide an ecosystem with a good 3rd party library that adds exactly what you want. In the Lua ecosystem that maybe it’s Penlight.

    But I totally get your point. Penlight doesn’t even seem to have a math library, so I found no round implementation there. This can be not a problem for some, but deal breaking for others.


  • I’m not fully sure what the intent of the joke is, but note that yes, it’s true that a header typically just has the prototype. However, tons of more advanced libraries are “header-only”. Everything is in a single header originally, in development, or it’s a collection of headers (that optionally gets “amalgamated” as a single header). This is sometimes done intentionally to simplify integration of the library (“just copy this files to your repo, or add it as a submodule”), but sometimes it’s entirely necessary because the code is just template code that needs to be in a header.

    C++ 20 adds modules, and the situation is a bit more involved, but I’m not confident enough of elaborating on this. :) Compile times are much better, but it’s something that the build system and the compilers needs to support.






  • The problem is not that the US is sparse, is that cities are. You are probably misunderstanding the problem, and if not, you are not explaining correctly. Check out The Dumbest Excuse for Bad Cities from Not Just Bikes for a breakdown of the issue.

    No one is blaming you individually, or even the US citizens individually. The problems are multiple for sure, but you won’t start to fix it unless you understand the issue properly. Maybe it’s not your case, but many US citizens are surely not seeing the point at all.