Excerpt:

“Even within the coding, it’s not working well,” said Smiley. “I’ll give you an example. Code can look right and pass the unit tests and still be wrong. The way you measure that is typically in benchmark tests. So a lot of these companies haven’t engaged in a proper feedback loop to see what the impact of AI coding is on the outcomes they care about. Lines of code, number of [pull requests], these are liabilities. These are not measures of engineering excellence.”

Measures of engineering excellence, said Smiley, include metrics like deployment frequency, lead time to production, change failure rate, mean time to restore, and incident severity. And we need a new set of metrics, he insists, to measure how AI affects engineering performance.

“We don’t know what those are yet,” he said.

One metric that might be helpful, he said, is measuring tokens burned to get to an approved pull request – a formally accepted change in software. That’s the kind of thing that needs to be assessed to determine whether AI helps an organization’s engineering practice.

To underscore the consequences of not having that kind of data, Smiley pointed to a recent attempt to rewrite SQLite in Rust using AI.

“It passed all the unit tests, the shape of the code looks right,” he said. It’s 3.7x more lines of code that performs 2,000 times worse than the actual SQLite. Two thousand times worse for a database is a non-viable product. It’s a dumpster fire. Throw it away. All that money you spent on it is worthless."

All the optimism about using AI for coding, Smiley argues, comes from measuring the wrong things.

“Coding works if you measure lines of code and pull requests,” he said. “Coding does not work if you measure quality and team performance. There’s no evidence to suggest that that’s moving in a positive direction.”

  • python@lemmy.world
    link
    fedilink
    arrow-up
    43
    ·
    2 days ago

    Recently had to call out a coworker for vibecoding all her unit tests. How did I know they were vibe coded? None of the tests had an assertion, so they literally couldn’t fail.

    • ch00f@lemmy.world
      link
      fedilink
      arrow-up
      25
      ·
      2 days ago

      Vibe coding guy wrote unit tests for our embedded project. Of course, the hardware peripherals aren’t available for unit tests on the dev machine/build server, so you sometimes have to write mock versions (like an “adc” function that just returns predetermined values in the format of the real analog-digital converter).

      Claude wrote the tests and mock hardware so well that it forgot to include any actual code from the project. The test cases were just testing the mock hardware.

      • 87Six@lemmy.zip
        link
        fedilink
        arrow-up
        16
        ·
        2 days ago

        Not realizing that should be an instant firing. The dev didn’t even glance a look at the unit tests…

    • nutsack@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      7
      ·
      2 days ago

      if you reject her pull requests, does she fix it? is there a way for management to see when an employee is pushing bad commits more frequently than usual?

    • urandom@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      2
      ·
      1 day ago

      That’s weird. I’ve made it write a few tests once, and it pretty much made them in the style of other tests in the repo. And they did have assertions.

      • clif@lemmy.world
        link
        fedilink
        arrow-up
        4
        arrow-down
        1
        ·
        edit-2
        1 day ago

        My company is pushing LLM code assistants REALLY hard (like, you WILL use it but we’re supposedly not flagging you for termination if you don’t… yet). My experience is the same as yours - unit tests are one of the places where it actually seems to do pretty good. It’s definitely not 100%, but in general it’s not bad and does seem to save some time in this particular area.

        That said, I did just remove a test that it created that verified that IMPORTED_CONSTANT is equal to localUnitTestConstantWithSameHardcodedValueAsImportedConstant. It passed ; )

      • rumba@lemmy.zip
        link
        fedilink
        English
        arrow-up
        4
        ·
        1 day ago

        Trust with verification. I’ve had it do everything right, I’ve had it do thing so incredibly stupid that even a cursory glance at the could would me more than enough to /clear and start back over.

        claude code is capable of producing code and unit tests, but it doesn’t always get it right. It’s smart enough that it will keep trying until it gets the result, but if you start running low on context it’ll start getting worse at it.

        I wouldn’t have it contribute a lot of code AND unit tests in the same session. new session, read this code and make unit tests. new session read these unit tests, give me advice on any problems or edge cases that might be missed.

        To be fair, if you’re not reading what it’s doing and guiding it, you’re fucking up.

        I think it’s better as a second set of eyes than a software architect.

        • urandom@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          ·
          1 day ago

          I think it’s better as a second set of eyes than a software architect.

          A rubber ducky that talks back is also a good analogy for me.

          I wouldn’t have it contribute a lot of code

          Yeah, I tried that once, for a tedious refactoring. It would’ve been faster if I did it myself tbh. Telling it to do small tedious things, and keeping the interesting things for yourself (cause why would you deprive yourself of that …) is currently where I stand with this tool

          • rumba@lemmy.zip
            link
            fedilink
            English
            arrow-up
            1
            ·
            9 hours ago

            and keeping the interesting things for yourself (cause why would you deprive yourself of that …

            I fear that will be required at some point. It’s not always good at writing code, but it does well enough that it can turn a seasoned developer into a manager. :/

    • melfie@lemy.lol
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      1 day ago

      Yeah, it’s a bad idea to let AI write both the code and the tests. If nothing else, at least review the tests more carefully than everything else and also do some manual testing. I won’t normally approve a PR unless it has a description of how it was tested with preferably some screenshots or log snippets.