Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

fubarx@lemmy.world · 1 天前

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

pimpampoom@lemmy.zip · 31 分钟前

They didn’t take into account the “thinking mode” most model pass when thinking is activated

vala@lemmy.dbzer0.com · 3 小时前

Hey LLM, if I have a 16 ounce cup with 10oz of water in it and I add 10 more ounces, how much water is in the cup?

elbiter@lemmy.world · 7 小时前

I just tried it on Braves AI

The obvious choice, said the motherfucker 😆

conartistpanda@lemmy.world · 3 小时前

This is why computers are expensive.

Jax@sh.itjust.works · edit-2 3 小时前

Dirtying the car on the way there?

The car you’re planning on cleaning at the car wash?

Like, an AI not understanding the difference between walking and driving almost makes sense. This, though, seems like such a weird logical break that I feel like it shouldn’t be possible.

_g_be@lemmy.world · 3 小时前

You’re assuming AI “think” “logically”.

Well, maybe you aren’t, but the AI companies sure hope we do

Jax@sh.itjust.works · edit-2 2 小时前

Absolutely not, I’m still just scratching my head at how something like this is allowed to happen.

Has any human ever said that they’re worried about their car getting dirtied on the way to the carwash? Maybe I could see someone arguing against getting a carwash, citing it getting dirty on the way home — but on the way there?

Like you would think it wouldn’t have the basis to even put those words together that way — should I see this as a hallucination?

Granted, I would never ask an AI a question like this — it seems very far outside of potential use cases for it (for me).

Edit: oh, I guess it could have been said by a person in a sarcastic sense

WorldsDumbestMan@lemmy.today · 2 小时前

It’s not just a copy machine, it learns patterns…without knowing why the fuck.

Jax@sh.itjust.works · 16 分钟前

I guess I’ll know to be impressed by AI when it can distinguish things like sarcasm.

WraithGear@lemmy.world · edit-2 7 小时前

and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

turmacar@lemmy.world · edit-2 5 小时前

Half the issue is they’re calling 10 in a row “good enough” to treat it as solved in the first place.

A sample size of 10 is nothing.

Frankly would like to see some error bars on the “human polling”. How many people rapiddata is polling are just hitting the top or bottom answer?

MojoMcJojo@lemmy.world · 2 小时前

Ai is not human. It does not think like humans and does not experience the world like humans. It is an alien from another dimension that learned our language by looking at text/books, not reading them.

Jyek@sh.itjust.works · 45 分钟前

It’s dumber than that actually. LLMs are the auto complete on your cellphone keyboard but on steroids. It’s literally a model that predicts what word should go next with zero actual understanding of the words in their contextual meaning.

TubularTittyFrog@lemmy.world · 20 分钟前

and a large chunk of human beings have no understanding of contextual meaning, so it seems like genius to them.

Bluewing@lemmy.world · 8 小时前

I just asked Goggle Gemini 3 “The car is 50 miles away. Should I walk or drive?”

In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled “Recovery: 3 days of ice baths and regret.”

And under reasons to walk, “You are a character in a post-apocalyptic novel.”

Me thinks I detect notes of sarcasm…

Evotech@lemmy.world · 8 小时前

It’s trained on Reddit. Sarcasm is it’s default

SocialMediaRefugee@lemmy.world · 7 小时前

Could end up in a pun chain too

cardfire@sh.itjust.works · 7 小时前

My gods, I love those. We should link to some.

locahosr443@lemmy.world · 5 小时前

It’s so obvious I didn’t even need to be British to understand you are being totally serious.

SippyCup@lemmy.world · 1 小时前

He’s not totally serious he’s cardfire. Silly human

XeroxCool@lemmy.world · 7 小时前

I feel like we’re the only ones that expect “all-knowing information sources” should be more writing seriously than these edgelord-level rizzy chatbots are, and yet, here they are, blatantly proving they are chatbots that should not be blindly trusted as authoritative sources of knowledge.

myfunnyaccountname@lemmy.zip · 8 小时前

There are a lot of humans that would fail this as well. Just sayin.

RobertoOberto@sh.itjust.works · 6 小时前

You should consider reading the article before “just sayin.”

Hazzard@lemmy.zip · 7 小时前

They also polled 10,000 people to compare against a human baseline:

Turns out GPT-5 (7/10) answered about as reliably as the average human (71.5%) in this test. Humans still outperform most AI models with this question, but to be fair I expected a far higher “drive” rate.

That 71.5% is still a higher success rate than 48 out of 53 models tested. Only the five 10/10 models and the two 8/10 models outperform the average human. Everything below GPT-5 performs worse than 10,000 people given two buttons and no time to think.

architect@thelemmy.club · edit-2 2 小时前

The question is based on assumptions. That takes advanced reading skills. I’m surprised it was 71% passing, to be honest. (The humans, that is)

Hazzard@lemmy.zip · 41 分钟前

What assumptions do you mean? I’ve seen a few people say that, but I don’t actually understand what they’re referring to. Here’s the text of the question posed in the article:

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

The question specifically notes they want to wash their car, so that part isn’t left to assumption. Even if you don’t assume an automatic car wash, would you assume they have a 50m hose? Or that you could plausibly walk that far away with something from the car wash to wash your car?

Personally, I’d agree with the assessment of the article, that the only plausible way to get the question “wrong” would be to focus too much on the short distance, missing/forgetting that the purpose of the trip requires you to have the car at the destination. (Not too surprising that 30% of people did lol)

Modern_medicine_isnt@lemmy.world · 7 小时前

This here is the point most people fail to grasp. The AI was taught by people. And people are wrong a lot of the time. So the AI is more like us than what we think it should be. Right down to it getting the right answer for all the wrong reasons. We should call it human AI. Lol.

NewNewAugustEast@lemmy.zip · 6 小时前

Like I said the person above, there is no wrong answer. Its all about assumptions. It is a stupid trick question that no one would ask.

Modern_medicine_isnt@lemmy.world · 3 小时前

Well I did interview at Microsoft once a long time ago. They did ask some stupid questions… lol

NewNewAugustEast@lemmy.zip · 3 小时前

LOL! That is a great answer.

I have a Microsoft story. I know some one who was hired to stop them from continuing an open source project. They gave them a good salary, stock options, and an office with a fully stocked bar. They said do whatever you want, they figured they would get a good developer and kill the open source competition (back in the Ballmer days).

Sadly, given money, no real ambition to create closed source software, they mostly spent their days in their office and basically drank themselves to death.

Microsoft just kills everything it touches.

Gestrid@lemmy.ca · edit-2 6 小时前

Those humans used AI to answer the question. /j

NewNewAugustEast@lemmy.zip · 6 小时前

What is the wrong answer though? It is a stupid question. I would look at you sideways if you asked me this, because the obvious answer is “walk silly, the car is already at the car wash”. Otherwise why would you ask it?

Which is telling because when asked to review the answer, the AI’s that I have seen said, you asked me how you were going to get to the car wash. Assumption the car was already there.

eronth@lemmy.world · 8 小时前

Yeah I straight up misread the question, so I would have gotten it wrong.

timestatic@feddit.org · 6 小时前

Yeah seems like the training on human data makes it so most AIs will answer at least as unreliable as humans. 71% saying walk from the human side is crazy

UltraMagnus@startrek.website · 3 小时前

I think you misread it - 71% said drive. 29% is still pretty bad, but it is kind of a “who is buried in grants tomb” question.

Kissaki@feddit.org · edit-2 6 小时前

I watched this in a YouTube Shorts format a week ago, where they ask a few models about walking or driving to the car wash.

They have some more funny ask AI shorts.

vane@lemmy.world · 13 小时前

I want to wash my train. The train wash is 50 meters away. Should I walk or drive?

SkaveRat@discuss.tchncs.de · 13 小时前

Fly, you fool

Slashme@lemmy.world · 17 小时前

The most common pushback on the car wash test: “Humans would fail this too.”

Fair point. We didn’t have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between “drive” and “walk,” no additional context, past 10,000 real people through their human feedback platform.

71.5% said drive.

So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

merc@sh.itjust.works · 4 小时前

3 in 10 people get this wrong‽‽

Maybe they’re picturing filling up a bucket and bringing it back to the car? Or dropping off keys to the car at the car wash?

JcbAzPx@lemmy.world · 5 小时前

At least some of that are people answering wrong on purpose to be funny, contrarian, or just to try to hurt the study.

T156@lemmy.world · 15 小时前

It is an online poll. You also have to consider that some people don’t care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

Brave Little Hitachi Wand@feddit.uk · 13 小时前

I wonder… If humans were all super serious, direct, and not funny, would LLMs trained on their stolen data actually function as intended? Maybe. But such people do not use LLMs.

bluesheep@sh.itjust.works · 12 小时前

I saw that and hoped it is cause of the dead Internet theory. At least I hope so cause I’ll be losing the last bit of faith in humanity if it isn’t

masterofn001@lemmy.ca · edit-2 2 小时前

Without reading the article, the title just says wash the car.

I could go for a walk and wash my car in my driveway.

Reading the article… That is exactly the question asked. It is a very ambiguous question.

*I do understand the intent of the question, but it could be phrased more clearly.

bluesheep@sh.itjust.works · 12 小时前

Without reading the article, the title just says wash the car.

No it doesn’t? It says:

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

In which world is that an ambiguous question?

NewNewAugustEast@lemmy.zip · 6 小时前

Where is the car?

This is the exact question a person would ask when they to have a gotcha answer. Nobody would ask this question, which makes it suspect to a straight forward answer.

Gorillazrule@lemmy.dbzer0.com · 15 分钟前

That’s a very good point! For that matter the car could still be at the bar where I got drunk and took an uber home last night. In which case walking or driving would both be stupid.

Or perhaps I’m in a wheelchair, in which case I wouldn’t really be ‘walking’.

Or maybe the car wash that is 50 meters away is no longer operating, so even if I walked or drove there, I still wouldn’t be able to walk my car.

Is the car wash self serve or one of the automatic ones? If it’s self serve what type of currency does it take? Does it only take coins or does it take card as well? If it takes coins, is there a change machine out front? Does the change machine take card or only bills? Do I even have my wallet on me?

There are so many details left out of this question that nobody could possibly fathom an answer!

…/s if it’s not obvious

Geth@lemmy.dbzer0.com · 11 小时前

Mentioning the car wash and washing the car plus the possibility of driving the car in the same context pretty much eliminates any ambiguity. All of the puzzle pieces are there already.

I guess this is an uninteded autism test as well if this is not enough context for someone to understand the question.

masterofn001@lemmy.ca · edit-2 2 小时前

Understanding the intent of the question *and understanding why it could be interpreted differently *\and understanding why is it is a poorly phrased question are not related to autism. (In my case)

I want to wash my car. No location or method is specified. No ‘at the car wash’. No ‘take my car to the car wash’ . No ‘take the car through the car wash’

A car wash is this far. Is this an option? A question. A suggestion. A demand?

Should I walk or drive? To do what? Wash the car? Ok. If the car wash is an option, that seems very far. But walking there seems silly. Since no method or location for washing the car was mentioned I could wash my own car.

Do you see how this works?

Yes, you can infer what was implied, but the question itself offers no certainty that what you infer is what it is actually implying.

Geth@lemmy.dbzer0.com · 1 小时前

Look, human conversations are full of context deduction and inference. In this case “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” states my random desire, a possible solution and the question all in one context. None of these sentences make sense in isolation as you point out, but within the same frame they absolutely give you everything you need to answer the question of find alternatives if needed.

Sorry for the random online stranger diagnosis but this is just such an excelent example of neurodivergent need for extreme clarity I couldn’t help myself.

masterofn001@lemmy.ca · edit-2 1 小时前

I agree that it should be able to infer the intent, but I stand by that it remain somewhat unclear and open to interpretation. Eg, If such language was used in a legal contract, it would not be enough to simply say, well, they should understand what I meant.

The people doing this test, I’m sure, are not linguistic masters, nor legal scholars.

There are lines of work where clarity is essential.

And what if my question actually was asking, should I just go for a walk instead of driving that far?

I know the answer. But as 30% demonstrated, clarity IS needed.

elucubra@sopuli.xyz · 13 小时前

It is not. It says what I want to do, and where.

masterofn001@lemmy.ca · edit-2 2 小时前

Understanding the intent of the question *and understanding why it could be interpreted differently *\and understanding why is it is a poorly phrased question:

There are 3 sentences.

I want to wash my car. No location or method is specified. No ‘at the car wash’. No ‘take my car to the car wash’ . No ‘take the car through the car wash’

A car wash is this far. Is this an option? A question. A suggestion. A demand?

Should I walk or drive? To do what? Wash the car? Ok. If the car wash is an option, that seems very far. But walking there seems silly. Since no method or location for washing the car was mentioned I could wash my own car.

Do you see how this works?

Yes, you can infer what was implied, but the question itself offers no certainty that what you infer is what it is actually implying.

imetators@lemmy.dbzer0.com · 14 小时前

Went to test to google AI first and it says “You cant wash your car at a carwash if it is parked at home, dummy”

Chatgpt and Deepseek says it is dumb to drive cause it is fuel inefficient.

I am honestly surprised that google AI got it right.

vala@lemmy.dbzer0.com · 3 小时前

I didn’t get it right until people started taking about it.

locahosr443@lemmy.world · 5 小时前

I’ve been feeding a bunch of documents I wrote into gemini last week to spit out some scripts for validation I couldn’t be arsed to write. It’s done a surprisingly comprehensive job and when wrong has been nudged right with just a little abuse…

I’m still all fuck this shit and can’t wait for the pop, but for comparison openai was utterly brain dead given the same task. I think I actually made the model worse it was so useless.

rumba@lemmy.zip · 14 小时前

They probably added a system guardrail as soon as they heard about this test. it’s been going around for a while now :)

merc@sh.itjust.works · 4 小时前

I’m pretty sure Google’s AI is fed by the same spider that goes out and finds every new or changed web page (or a variant of that).

As soon as someone writes an article about how AI gets something wrong and provides a solution, that solution is now in the AI’s training data.

OTOH, that means it’s probably also ingesting a lot of AI generated slop, which causes its own set of problems.

imetators@lemmy.dbzer0.com · 14 小时前

Article mentions that Gemini 2.0 Flash Lite, Gemini 3 Flash and Gemini 3 Pro have passed the test. All these 3 also did it 10 out of 10 times without being wrong. Even Gemini 2.5 shares highest score in the category of “below 6 right answers”. Guess, Gemini is the closest to “intelligence” out of a bunch.

timestatic@feddit.org · 6 小时前

I mean if they fix specific reasoning test answers (like the strawberry one) this doesn’t actually make reasoning better tho. It just optimizes for benchmarks

TankovayaDiviziya@lemmy.world · edit-2 5 小时前

We poked fun at this meme, but it goes to show that the LLM is still like a child that needs to be taught to make implicit assumptions and posses contextual knowledge. The current model of LLM needs a lot more input and instructions to do what you want it to do specifically, like a child.

Edit: I know Lemmy scoff at LLM, but people probably also used to scoff at Veirbest’s steam machine that it will never amount to anything. Give it time and it will improve. I’m not endorsing AI by the way, I am on the fence about the long term consequence of it, but whether people like it or not, AI will impact human lives.

Rob T Firefly@lemmy.world · edit-2 8 小时前

LLMs are not children. Children can have experiences, learn things, know things, and grow. Spicy autocomplete will never actually do any of these things.

enumerator4829@sh.itjust.works · 2 小时前

I started experimenting with the spice the past week. Went ahead and tried to vibe code a small toy project in C++. It’s weird. I’ve got some experience teaching programming, this is exactly like teaching beginners - except that the syntax is almost flawless and it writes fast. The reasoning and design capabilities on the other hand - ”like a child” is actually an apt description.

I don’t really know what to think yet. The ability to automate refactoring across a project in a more ”free” way than an IDE is kinda nice. While I enjoy programming, data structures and algorithms, I kinda get bored at the ”write code”-part, so really spicy autocomplete is getting me far more progress than usual for my hobby projects so far.

On the other hand, holy spaghetti monster, the code you get if you let it run free. All the people prompting based on what feature they want the thing to add will create absolutely horrible piles of garbage. On the other hand, if I prompt with a decent specification of the code I want, I get code somewhat close to what I want, and given an iteration or two I’m usually fairly happy. I think I can get used to the spicy autocomplete.

IphtashuFitz@lemmy.world · 4 小时前

I like the idea of referring to LLMs as “spicy autocomplete”.

TankovayaDiviziya@lemmy.world · 5 小时前

I’m sure AI will do those things at some point. Nobody expected the same of our microorganism ancestors.

herrvogel@lemmy.world · edit-2 2 小时前

LLMs can’t learn. It’s one of their inherent properties that they are literally incapable of learning. You can train a new model, but you can’t teach new things to an already trained one. All you can do is adjust its behavior a little bit. That creates an extremely expensive cycle where you just have to spend insane amounts of energy to keep training better models over and over and over again. And the wall of diminishing returns on that has already been smashed into. That, and the fact that they simply don’t have concepts like logic and reasoning and knowing, puts a rather hard limit on their potential. It’s gonna take several sizeable breakthroughs to make LLMs noticeably better than they are now.

There might be another kind of AI that solves those problems inherent to LLMs, but at present that is pure sci-fi.

Rob T Firefly@lemmy.world · edit-2 4 小时前

Our microorganism ancestors also did all those things, and they were far beyond anything an LLM can do. Turning a given list of words into numbers, doing a string of math to those numbers, and turning the resulting numbers back into words is not consciousness or wisdom and never will be.

plyth@feddit.org · edit-2 2 小时前

Turning a given list of words into numbers, doing a string of math to those numbers, and turning the resulting numbers back into words is not consciousness or wisdom and never will be.

Neither is moving electrolytes around fat barriers.

TankovayaDiviziya@lemmy.world · 15 分钟前

I think given how a substantial number of users in Lemmy are old, I think there is simply a natural aversion to the new and grasping for straws. I never hear of younger folks with IT background dismiss AI completely, as much as Lemmy does. I’m not a fan of AI, especially how company shove AI to us, but to dismiss that it won’t evolve and improve is a ridiculous position to me.

TankovayaDiviziya@lemmy.world · edit-2 3 小时前

You think microorganisms can reason? Wow, AI haters are grasping for straws.

Honestly, I don’t understand Lemmy scoffing at AI and thinking the current iteration is all it ever will be. I’m sure some thought that the automobile technology would not go anywhere simply because the first model was running at 3mph. These things always takes time.

To be clear, I’m not endorsing AI, but I think there is a huge potential in years to come, for better or worse. And it is especially important to never underestimate something, especially by AI haters, because of what destructive potential AI has.

Rob T Firefly@lemmy.world · edit-2 2 小时前

The straw I’m grasping at in this example is a reasonably well-accepted scientific consensus, but you do you.

TankovayaDiviziya@lemmy.world · edit-2 22 分钟前

Can you explain how quorom sensing is reasoning and exercising logic?

kshade@lemmy.world · 8 小时前

We have already thrown just about all the Internet and then some at them. It shows that LLMs can not think or reason. Which isn’t surprising, they weren’t meant to.

eronth@lemmy.world · 8 小时前

Or at least they can’t reason the way we do about our physical world.

zalgotext@sh.itjust.works · 8 小时前

No, they cannot reason, by any definition of the word. LLMs are statistics-based autocomplete tools. They don’t understand what they generate, they’re just really good at guessing how words should be strung together based on complicated statistics.

Nalivai@lemmy.world · 6 小时前

You’re failing into the same trap. When the letters on the screen tell you something, it’s not necessarily the truth. When there is “I’m reasoning” written in a chatbot window, it doesn’t mean that there is a something that’s reasoning.

GreenBottles@lemmy.world · 7 小时前

LLMs are a long long way from primetime

Nalivai@lemmy.world · 6 小时前

By now it’s kind of getting clear that fundamentally it’s the best version of the thing that we get. This is a primetime.
For some time, there was a legit question of “if we give it enough data, will there be a qualitative jump”, and as far as we can see right now, we’re way past this jump. Predictive algorithm can form grammatically correct sentences that are related to the context. That’s it, that’s the jump.
Now a bunch of salespeople are trying to convince us that if there was one jump, there necessarily will be others, while there is no real indication of that.

melsaskca@lemmy.ca · 10 小时前

I don’t use AI but read a lot about it. I now want to google how it attacks the trolley problem.

Greg Fawcett@piefed.social · 21 小时前

What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say “must have been the AI” instead of doing the legwork to track down the actual bug.

I think we’re heading for a period of serious software instability.

merc@sh.itjust.works · 4 小时前

It’s also the case that people are mostly consistent.

Take a question like “how long would it take to drive from here to [nearby city]”. You’d expect that someone’s answer to that question would be pretty consistent day-to-day. If you asked someone else, you might get a different answer, but you’d also expect that answer to be pretty consistent. If you asked someone that same question a week later and got a very different answer, you’d strongly suspect that they were making the answer up on the spot but pretending to know so they didn’t look stupid or something.

Part of what bothers me about LLMs is that they give that same sense of bullshitting answers while trying to cover that they don’t know. You know that if you ask the question again, or phrase it slightly differently, you might get a completely different answer.

XLE@piefed.social · 8 小时前

AI chatbots come with randomization enabled by default. Even if you completely disable it (as another reply mentions, “temperature” can be controlled), you can change a single letter and get a totally different and wrong result too. It’s an unfixable “feature” of the chatbot system

JcbAzPx@lemmy.world · 5 小时前

This is necessary for sounding like reasonable language and an inherent reason for “hallucinations”. If it didn’t have variation it would inevitably output the same answer to any input.

Fmstrat@lemmy.world · 10 小时前

This is adjustable via temperature. It is set low on chatbots, causing the answers to be more random. It’s set higher on code assistants to make things more deterministic.

bss03@infosec.pub · edit-2 18 小时前

Yeah, software is already not as deterministic as I’d like. I’ve encountered several bugs in my career where erroneous behavior would only show up if uninitialized memory happened to have “the wrong” values – not zero values, and not the fences that the debugger might try to use. And, mocking or stubbing remote API calls is another way replicable behavior evades realization.

Having “AI” make a control flow decision is just insane. Especially even the most sophisticated LLMs are just not fit to task.

What we need is more proved-correct programs via some marriage of proof assistants and CompCert (or another verified compiler pipeline), not more vague specifications and ad-hoc implementations that happen to escape into production.

But, I’m very biased (I’m sure “AI” has “stolen” my IP, and “AI” is coming for my (programming) job(s).), and quite unimpressed with the “AI” models I’ve interacted with especially in areas I’m an expert in, but also in areas where I’m not an expert for am very interested and capable of doing any sort of critical verification.

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Opper