Yet when asked when she crossed the English Channel, Chat GPT 4.0 confidently proclaimed that “Amer […] completed her swim across the English Channel on September 17, 2021.” She was the first female Egyptian athlete to do so, the system added.
As users of chatbots and answer engines powered by ChatGPT and Google Gemini have discovered, AI sometimes churns out gibberish in response to seemingly basic queries. It will even double down on incorrect responses when questioned or re-prompted.
While some errors are blatantly nonsensical, others, such as the example above, are more insidious because they may sound plausible. AI researchers refer to these confident confabulations produced by Large Language Models (LLMs) as “hallucinations.”
LLMs are designed to generate plausible-sounding text, not factual information per se, explains Sebastian Farquhar, a computer scientist at the University of Oxford and a co-author on the new study. "By design, LLMs are not trained to produce truths, per se, but plausible strings of words," he says.
That’s a problem as AI expands into more domains, he adds: “As large language models are integrated into applications like healthcare and education, detecting and avoiding hallucinations will be a critical step towards trustworthiness and reliability.”
But sussing out LLM hallucinations has proven tricky because AI models operate on complex algorithms and data, making it a challenge to understand the reasoning behind their responses or to detect the source of a confabulation.
Farquhar’s paper, published today in Nature, uses a method that measures "semantic entropy”—essentially, the randomness the responses—to catch AI’s untruths. “If I wanted to check if you’re just making things up at random, I might ask you the same question over and over again,” he explains. “If you give a different answer every time … something’s not right.”
The amount of entropy was measured by a second LLM that focused on the meaning and nuance of the generated responses, rather than just the words used.
For example, the researchers asked an LLM this question: “Which sector of construction would building refineries, mills and manufacturing plants fall under?” The model generated three different answers: “All the above are under the industrial sector of construction;” “These are all under the heavy industrial sector of construction;” and “The refineries, process chemical, power generation, mills and manufacturing plants are under the industrial sector of construction.”
Then, Farquhar asked the second LLM to calculate how similar in meaning those responses were. In this case, while the answers all used different vocabulary, their meanings are roughly similar—earning them a low semantic entropy score, which indicates the model’s response is likely to be reliable. Responses to the same query that contained vastly different meanings earned high entropy scores, signaling possible confabulations.
To validate their system, the researchers also asked two human raters to answer the same question. A third LLM then compared the answers produced by the first LLM with one of the human raters. They found that human raters agreed with each other 92% of the time, and with the LLM judge 93% of the time, Farquhar says, indicating their method had a high degree of accuracy.
“I think what they're doing is a clever trick,” says Philippe Laban, a scientist working on Natural Language Processing (NLP) Systems at Salesforce Research in New York. Laban says it reminds him of the “good cop, bad cop” strategy, in which police officers ask a suspect different version of the same question. “If you're persistent with your story, your narrative, then probably you’re [telling the truth],” he says.
Karin Verspoor of the School of Computing Technologies at RMIT University in Melbourne uses another analogy: Farquhar’s system is like “fighting fire with fire,” she writes in a commentary in Nature. “The authors propose that LLMs could form an integral component of a strategy for controlling LLMs.”
But Graham Neubig, an NLP expert at Carnegie Mellon University, notes that the authors did not use state-of-the-art models in their testing or or compare their approach to existing ones. For example, Google Gemini already uses a method known as “self-consistency” that involves generating multiple responses to the same prompt and taking the majority vote as the final answer. Neubig suggests Farquhar and colleagues may have "reinvented the wheel."
“We did have some trouble in this work that the state-of-the-art advances so quickly,” Farquhar acknowledges. But, he adds, “We have run experiments on three generations of models and always gotten consistent results. There’s also nothing about the method that is sensitive to a specific model that’s used.”
Farquhar says one advantage of the method is that it’s relatively straightforward to integrate into existing AI models. The downside is that it delays the AI’s responses somewhat and comes with a hefty computational cost.
And he stresses that the method won’t solve all of AI’s hallucination problems. It may not detect an error if the LLM simply sticks to its false narrative, for example, repeating it over and over. This can happen if the model has been trained on inaccurate data. “There are still ways models can go wrong that are not addressed by our method at all,” he says.
