Carnegie Mellon study exposes gaps in AI's self-awareness
For the average person, artificial intelligence chatbots are a quick and easy way to find a trivia fact, homemade cookie recipe or even sample emails to send to your boss. But as this technology becomes widespread — infiltrating jobs and government offices — people do not realize the faults in AI’s overconfidence.
Carnegie Mellon University has been at the center of AI research and implementation in Pennsylvania, publishing research that reveals a gap in AI’s “metacognition,” which is the ability to reflect on its own thinking.
Lead author Trent Cash, a doctoral student in CMU’s departments of Psychology and Social and Decision Sciences, and co-author Daniel Oppenheimer, a professor in CMU’s Department of Social and Decision Sciences, found that chatbots or language learning models (LLMs) produce information as though it is fact, even if it is misleading or incorrect.
“I think in the language that LLMs use they sound very confident,” Cash said. “They always use this kind of authoritative, friendly tone that’s like, ‘Yeah, I’m an expert in this. I know what I’m talking about,’ but that doesn’t line up with what we’re finding.”
Internal experience
For the research, they conducted five studies using a different AI models, including: OpenAI’s ChatGPT, Google’s Gemini, and two versions of Anthropic’s Claude — Sonnet and Haiku.
The studies then compared AI’s confidence and human confidence levels in their ability to answer trivia questions, predict the outcomes of NFL games or Academy Award ceremonies or play an image identification game.
LLMs are not designed to answer everything users are throwing at them on a daily basis, Oppenheimer said in a news release.
After guessing, humans tended to recalculate how well they did in their predictions, while AI programs thought they had a better performance than they actually did.
“[Humans] use lots of different factors in the environment, but one of the key things that humans use is the internal feeling that you get when something feels easy or hard,” he said. “LLMs are information-processing machines. They can process so much more information than we can, and they can build those confidence judgments off of those that massive pile of data, but they’re still lacking that internal experience.”
Harmful hallucinations
So, if both human and chatbot predictions can be inaccurate, what makes AI’s overconfidence alarming?
“I think the biggest issue that most researchers are concerned about is that LLMs engage in what we refer to as hallucinations, where they will totally make up information that doesn’t exist,” Cash said.
Oppenheimer said that while users might not be skeptical when AI produces information that “seems a bit fishy,” they should be because AI produces answers with unwarranted confidence.
In March, Gov. Josh Shapiro along with OpenAI, the artificial intelligence organization that created ChatGPT, and CMU announced the results of the phase one of generative AI pilot program for state employees.
The state’s report summary showed that the largest uses for AI by state employees were for writing assistance, text generation and drafting emails. It showed that 27% of employees used it for research, idea exploration and learning new topics and 85% of commonwealth employees had a positive experience using AI for daily tasks such as simplifying job descriptions and legal documents.
“The report also identifies AI inaccuracies and hallucinations as a potential challenge of using AI tools, and makes clear that any and all AI materials must be subject to expert review by a commonwealth employee to correct potential inaccuracies,” said Daniel Egan, a spokesman for the governor’s office. “The Shapiro administration has consistently emphasized that generative AI is not a replacement for the knowledge and experience of commonwealth employees.”
On Aug. 6, following Pennsylvania’s pilot program, OpenAI announced “OpenAI for Government,” which is a new partnership with the U.S. General Services Administration that will provide the entire federal executive branch workforce ChatGPT for the next year.
When asked how false information would impact an office setting, such as employees working in government offices, Cash said the study’s findings did not determine whether LLMs were good or bad.
“I do think they present a word of caution, that when you are working with them, you have to remember that sometimes they’re going to make mistakes,” he said. “But people are increasingly going toward LLMs for more important things like medical diagnoses, legal advice, which [I] have some concern about people doing that.”
Megan Trotter is a TribLive staff writer. She can be reached at mtrotter@triblive.com.
Remove the ads from your TribLIVE reading experience but still support the journalists who create the content with TribLIVE Ad-Free.