AI agents wrong ~70% of time: Carnegie Mellon study

Jaden Norman@lemmy.world · 2 months ago

AI agents wrong ~70% of time: Carnegie Mellon study

TheGrandNagus@lemmy.world · edit-2 2 months ago

LLMs are an interesting tool to fuck around with, but I see things that are hilariously wrong often enough to know that they should not be used for anything serious. Shit, they probably shouldn’t be used for most things that are not serious either.

It’s a shame that by applying the same “AI” naming to a whole host of different technologies, LLMs being limited in usability - yet hyped to the moon - is hurting other more impressive advancements.

For example, speech synthesis is improving so much right now, which has been great for my sister who relies on screen reader software.

Being able to recognise speech in loud environments, or removing background noice from recordings is improving loads too.

My friend is involved in making a mod for a Fallout 4, and there was an outreach for people recording voice lines - she says that there are some recordings of dubious quality that would’ve been unusable before that can now be used without issue thanks to AI denoising algorithms. That is genuinely useful!

As is things like pattern/image analysis which appears very promising in medical analysis.

All of these get branded as “AI”. A layperson might not realise that they are completely different branches of technology, and then therefore reject useful applications of “AI” tech, because they’ve learned not to trust anything branded as AI, due to being let down by LLMs.

snooggums@lemmy.world · 2 months ago

LLMs are like a multitool, they can do lots of easy things mostly fine as long as it is not complicated and doesn’t need to be exactly right. But they are being promoted as a whole toolkit as if they are able to be used to do the same work as effectively as a hammer, power drill, table saw, vise, and wrench.

TeddE@lemmy.world · 2 months ago

Because the tech industry hasn’t had a real hit of it’s favorite poison “private equity” in too long.

The industry has played the same playbook since at least 2006. Likely before, but that’s when I personally stated seeing it. My take is that they got addicted to the dotcom bubble and decided they can and should recreate the magic evey 3-5 years or so.

This time it’s AI, last it was crypto, and we’ve had web 2.0, 3.0, and a few others I’m likely missing.

But yeah, it’s sold like a panacea every time, when really it’s revolutionary for like a handful of tasks.

rottingleaf@lemmy.world · 2 months ago

That’s because they look like “talking machines” from various sci-fi. Normies feel as if they are touching the very edge of the progress. The rest of our life and the Internet kinda don’t give that feeling anymore.

sugar_in_your_tea@sh.itjust.works · 2 months ago

Exactly! LLMs are useful when used properly, and terrible when not used properly, like any other tool. Here are some things they’re great at:

writer’s block - get something relevant on the page to get ideas flowing
narrowing down keywords for an unfamiliar topic
getting a quick intro to an unfamiliar topic
looking up facts you’re having trouble remembering (i.e. you’ll know it when you see it)

Some things it’s terrible at:

deep research - verify everything an LLM generated of accuracy is at all important
creating important documents/code
anything else where correctness is paramount

I use LLMs a handful of times a week, and pretty much only when I’m stuck and need a kick in a new (hopefully right) direction.

snooggums@lemmy.world · edit-2 2 months ago

narrowing down keywords for an unfamiliar topic

getting a quick intro to an unfamiliar topic

looking up facts you’re having trouble remembering (i.e. you’ll know it when you see it)

I used to be able to use Google and other search engines to do these things before they went to shit in the pursuit of AI integration.

NarrativeBear@lemmy.world · 2 months ago

Just add a search yesterday on the App Store and Google Play Store to see what new “productivity apps” are around. Pretty much every app now has AI somewhere in its name.

Punkie@lemmy.world · 2 months ago

I’d compare LLMs to a junior executive. Probably gets the basic stuff right, but check and verify for anything important or complicated. Break tasks down into easier steps.

zbyte64@awful.systems · edit-2 2 months ago

A junior developer actually learns from doing the job, an LLM only learns when they update the training corpus and develop an updated model.

jumping_redditor@sh.itjust.works · 2 months ago

an llm costs less, and won’t compain when yelled at

zbyte64@awful.systems · 2 months ago

Why would you ever yell at an employee unless you’re bad at managing people? And you think you can manage an LLM better because it doesn’t complain when you’re obviously wrong?

Katana314@lemmy.world · 2 months ago

I’m in a workplace that has tried not to be overbearing about AI, but has encouraged us to use them for coding.

I’ve tried to give mine some very simple tasks like writing a unit test just for the constructor of a class to verify current behavior, and it generates output that’s both wrong and doesn’t verify anything.

I’m aware it sometimes gets better with more intricate, specific instructions, and that I can offer it further corrections, but at that point it’s not even saving time. I would do this with a human in the hopes that they would continue to retain the knowledge, but I don’t even have hopes for AI to apply those lessons in new contexts. In a way, it’s been a sigh of relief to realize just like Dotcom, just like 3D TVs, just like home smart assistants, it is a bubble.

MangoCats@feddit.it · 2 months ago

The first half dozen times I tried AI for code, across the past year or so, it failed pretty much as you describe.

Finally, I hit on some things it can do. For me: keeping the instructions more general, not specifying certain libraries for instance, was the key to getting something that actually does something. Also, if it doesn’t show you the whole program, get it to show you the whole thing, and make it fix its own mistakes so you can build on working code with later requests.

SocialMediaRefugee@lemmy.world · 2 months ago

I’ve had good results being very specific, like “Generate some python 3 code for me that converts X to Y, recursively through all subdirectories, and converts the files in place.”

MangoCats@feddit.it · 2 months ago

I have been more successful with baby steps like: “Write a python 3 program that converts X to Y.” Tweak prompt until that’s working as desired, then: “make it work recursively through all subdirectories” - and again tweak with specifics like converting the files in place, etc. Always very specific, also - force it to fix its own bugs so you can move forward with a clean example as you add complexity. Complexity seems to cap out at a couple of pages of code, at which point “Ooops, something went wrong.”

jj4211@lemmy.world · 2 months ago

I’ve found that as an ambient code completion facility it’s… interesting, but I don’t know if it’s useful or not…

So on average, it’s totally wrong about 80% of the time, 19% of the time the first line or two is useful (either correct or close enough to fix), and 1% of the time it seems to actually fill in a substantial portion in a roughly acceptable way.

It’s exceedingly frustrating and annoying, but not sure I can call it a net loss in time.

So reviewing the proposal for relevance and cut off and edits adds time to my workflow. Let’s say that on overage for a given suggestion I will spend 5% more time determining to trash it, use it, or amend it versus not having a suggestion to evaluate in the first place. If the 20% useful time is 500% faster for those scenarios, then I come out ahead overall, though I’m annoyed 80% of the time. My guess as to whether the suggestion is even worth looking at improves, if I’m filling in a pretty boilerplate thing (e.g. taking some variables and starting to write out argument parsing), then it has a high chance of a substantial match. If I’m doing something even vaguely esoteric, I just ignore the suggestions popping up.

However, the 20% is a problem still since I’m maybe too lazy and complacent and spending the 100 milliseconds glancing at one word that looks right in review will sometimes fail me compared to spending 2-3 seconds having to type that same word out by hand.

That 20% success rate allowing for me to fix it up and dispose of most of it works for code completion, but prompt driven tasks seem to be so much worse for me that it is hard to imagine it to be better than the trouble it brings.

Vanilla_PuddinFudge@infosec.pub · 2 months ago

America: “Good enough to handle 911 calls!”

Decq@lemmy.world · 2 months ago

Is there really a plan to use this for 911 services??

jsomae@lemmy.ml · edit-2 2 months ago

I’d just like to point out that, from the perspective of somebody watching AI develop for the past 10 years, completing 30% of automated tasks successfully is pretty good! Ten years ago they could not do this at all. Overlooking all the other issues with AI, I think we are all irritated with the AI hype people for saying things like they can be right 100% of the time – Amazon’s new CEO actually said they would be able to achieve 100% accuracy this year, lmao. But being able to do 30% of tasks successfully is already useful.

MangoCats@feddit.it · 2 months ago

being able to do 30% of tasks successfully is already useful.

If you have a good testing program, it can be.

If you use AI to write the test cases…? I wouldn’t fly on that airplane.

jsomae@lemmy.ml · 2 months ago

obviously

NarrativeBear@lemmy.world · 2 months ago

The ones being implemented into emergency call centers are better though? Right?

TeddE@lemmy.world · 2 months ago

Yes! We’ve gotten them up to 94℅ wrong at the behest of insurance agencies.

ApeNo1@lemmy.world · 2 months ago

They’ve done studies, you know. 30% of the time, it works every time.

MangoCats@feddit.it · 2 months ago

I ask AI to write simple little programs. One time in three they actually compile without errors. To the credit of the AI, I can feed it the error and about half the time it will fix it. Then, when it compiles and runs without crashing, about one time in three it will actually do what I wanted. To the credit of AI, I can give it revised instructions and about half the time it can fix the program to work as intended.

So, yeah, a lot like interns.

Log in | Sign up@lemmy.world · edit-2 2 months ago

Wow. 30% accuracy was the high score!
From the article:

Testing agents at the office

For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

They call it TheAgentCompany. It’s a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

⚫ Gemini-2.5-Pro (30.3 percent)
⚫ Claude-3.7-Sonnet (26.3 percent)
⚫ Claude-3.5-Sonnet (24 percent)
⚫ Gemini-2.0-Flash (11.4 percent)
⚫ GPT-4o (8.6 percent)
⚫ o3-mini (4.0 percent)
⚫ Gemini-1.5-Pro (3.4 percent)
⚫ Amazon-Nova-Pro-v1 (1.7 percent)
⚫ Llama-3.1-405b (7.4 percent)
⚫ Llama-3.3-70b (6.9 percent),
⚫ Qwen-2.5-72b (5.7 percent),
⚫ Llama-3.1-70b (1.7 percent)
⚫ Qwen-2-72b (1.1 percent).

“We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks,” the authors state in their paper

lepinkainen@lemmy.world · 2 months ago

Wrong 70% doing what?

I’ve used LLMs as a Stack Overflow / MSDN replacement for over a year and if they fucked up 7/10 questions I’d stop.

Same with code, any free model can easily generate simple scripts and utilities with maybe 10% error rate, definitely not 70%

vane@lemmy.world · 2 months ago

Reading with CEO mindset. 3 out of 10 employees can be fired.

kinsnik@lemmy.world · 2 months ago

I haven’t used AI agents yet, but my job is kinda pushing for them. but i have used the google one that creates audio podcasts, just to play around, since my coworkers were using it to “learn” new things. i feed it with some of my own writing and created the podcast. it was fun, it was an audio overview of what i wrote. about 80% was cool analysis, but 20% was straight out of nowhere bullshit (which i know because I wrote the original texts that the audio was talking about). i can’t believe that people are using this for subjects that they have no knowledge. it is a fun toy for a few minutes (which is not worth the cost to the environment anyway)

FenderStratocaster@lemmy.world · 2 months ago

I tried to order food at Taco Bell drive through the other day and they had an AI thing taking your order. I was so frustrated that I couldn’t order something that was on the menu I just drove to the window instead. The guy that worked there was more interested in lecturing me on how I need to order. I just said forget it and drove off.

If you want to use AI, I’m not going to use your services or products unless I’m forced to. Looking at you Xfinity.

Frenezul0_o@lemmy.world · 2 months ago

I notice that the research didn’t include DeepSeek. It would have been nice to see how it compares.

iopq@lemmy.world · 2 months ago

Now I’m curious, what’s the average score for humans?

gargle@lemmy.world · 2 months ago

I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It’s a lot of work. I stopped caring and moved on.

For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.

Colour me unimpressed. I dread the day when they force the use of ‘AI’ on us at work.

SocialMediaRefugee@lemmy.world · edit-2 2 months ago

I use it for very specific tasks and give as much information as possible. I usually have to give it more feedback to get to the desired goal. For instance I will ask it how to resolve an error message. I’ve even asked it for some short python code. I almost always get good feedback when doing that. Asking it about basic facts works too like science questions.

One thing I have had problems with is if the error is sort of an oddball it will give me suggestions that don’t work with my OS/app version even though I gave it that info. Then I give it feedback and eventually it will loop back to its original suggestions, so it couldn’t come up with an answer.

I’ve also found differences in chatgpt vs MS copilot with chatgpt usually being better results.