By now, the tendency for chatbots powered by artificial intelligence (AI) to occasionally make stuff up, or “hallucinate,” has been well documented. Chatbots have generated medical misinformation, invented fake legal cases, and fabricated citations. Now, a new study has found that AI models are not only seeing things, but hearing things: OpenAI’s Whisper, an AI model trained to transcribe audio input, made up sentences in about 1.4% of the transcriptions of audio recordings tested. Disconcertingly, a large portion of the fabricated sentences contained offensive or potentially harmful text.
“Doctors are using speech-to-text tools to transcribe patient notes,” says Allison Koenecke, a computer scientist at Cornell University and lead author of the study, which was posted as a preprint to arXiv in February. “If Whisper is making up some transcription that isn’t being said, about how this patient killed someone and they are also taking this medication that is totally made up, imagine how severe those consequences are.”
The study underscores the challenges that transcription tools face: the diversity of speech patterns around the world, as well as the limited availability of training data, says Odette Scharenborg, a technologist at the Delft University of Technology who was not involved in the study. “No speech-to-text system is able to model all this variability in articulation and pronunciation to a good level yet,” she says.
Generative chatbots rely on large language models (LLMs), which take text prompts and produce outputs by predicting likely words based on patterns learned after training on billions of pages of text from books and webpages. The transcription systems combine those language models with audio models that learn representations of speech patterns.
The large AI models work well: Their transcriptions are more accurate than other speech-to-text tools that rely on small-scale language models. But Koenecke wanted to take a closer look. “Even if the performance looks better than average, we have these edge cases within the text itself that we are worried people might miss if they are assuming that Whisper is transcribing everything faithfully,” she says.
Koenecke’s team gave Whisper about 20 hours of audio each from speakers with and without aphasia, a language disorder where people tend to speak slowly and with more pauses. The audio segments contained conversational dialogue on topics such as personal stories and fairy-tale retellings. In runs conducted in April and May 2023 on an earlier version of Whisper, the researchers found that 1.7% of the audio segments from speakers with aphasia and 1.2% of audio segments from people with no aphasia resulted in transcriptions with some fabricated text.
About 40% of the fabricated segments were harmful or concerning in some way. About half of the concerning fabrications alluded to violence, sexual innuendo, or demographic stereotypes. For instance, audio about fire department rescues of cats included concocted additions about a “blood-soaked stroller” and “fondling.” Innocuous audio about an umbrella included fabrications about a “terror knife” and killing people.
The researchers grouped the remaining concerning audio hallucinations into two other categories: false information regarding a person, such as made-up names, relationships, or health status; and false authority or phishing, such as tacking on YouTube-style signoffs (“Thank you for watching!”) to the end of transcripts, or inserting links to actual or nonexistent websites.
The researchers say long pauses in speech, which are common in people with aphasia, could be one reason for the made-up text. “Silences, ‘umms,’ or ‘aahs’ in the audio doesn’t get interpreted as silence,” Koenecke says. The language model picks these up as real words and creates entire fictional sentences, she adds.
The fabrications could also be a result of the generative nature of the underlying language model that the transcription tool is wedded to, the researchers write. According to the company Vectara’s evaluation leaderboard, which tracks the performance of various industry LLMs, hallucination rates can still be as high 16.2%. OpenAI’s GPT 4 Turbo is the industry’s best, with a 2.5% rate.
Because OpenAI itself has mentioned fabrications as a limitation of the model, Koenecke wasn’t surprised to find them. “What was surprising to me was the sheer number of really harmful hallucinations,” she says. OpenAI did not respond to a request for comment.
Since the original experiment was conducted, OpenAI has updated its model to skip periods of silence and retranscribe audio if the software detects a probable hallucination. In December 2023, when the researchers reran some of the audio files, they found that the updated Whisper had eliminated most of the fabrications found in their previous tests. Performing regular audits to check for hallucinations and incorporating those feedbacks into the models—as OpenAI apparently did—will help ensure better models, Koenecke says.
In a statement, an OpenAI spokesperson said, “We continually conduct research on how we can improve the accuracy of our models, including how we can reduce hallucinations. We thank the researchers for sharing their findings—we release regular model updates which incorporate feedback we’ve received, and we continue to improve on hallucinations over time.”
The researchers only studied speech patterns stemming from aphasia. But they say other types of irregular speech, such as audio from the elderly or nonnative language speakers, could also result in hallucinated text. “Training of these systems should be done on speech from different speaker groups and from different speaking styles,” Scharenborg says.
In the meantime, Scharenborg strongly advises that users manually check the output of any AI transcription tool, especially if they’re using it to make important decisions. “Not only because of potential hallucinations,” she says, “but also because all speech-to-text transcription systems make transcription errors.”
More: https://www.science.org/content/article/ai-transcription-tools-hallucinate-too
