A team of computer scientists at Purdue University has discovered that the popular language model ChatGPT is significantly inaccurate when responding to computer programming questions. In their paper published in the Proceedings of the CHI Conference on Human Factors in Computing Systems, the researchers detailed how they sourced questions from the StackOverflow website, posed them to ChatGPT, and measured the model's accuracy.

The findings were also presented at the CHI 2024 Conference held from May 11-16.

ChatGPT and other large language models (LLMs) have recently garnered significant attention, becoming widely popular with the general public. However, despite the wealth of useful information these models can provide, they often produce inaccurate answers. The troubling aspect is that it is not always evident when the responses are incorrect.

In this study, the Purdue team noted that many programming students have started using LLMs not only to help write code for assignments but also to answer programming-related questions. For instance, a student might ask ChatGPT, "What is the difference between bubble sort and merge sort?" or "What is recursion?"

To evaluate the accuracy of LLMs in answering such questions, the researchers focused specifically on ChatGPT. They used questions freely available on the StackOverflow website, a platform designed to help programmers learn by collaborating and sharing knowledge. On this site, users can post questions that are then answered by others with expertise.

The research team selected 517 questions from StackOverflow and assessed how often ChatGPT provided the correct answers. The study primarily used the GPT-3.5 model available in the free version of ChatGPT for manual responses to these 517 questions and employed the GPT-3.5-turbo API for a larger automated test involving an additional 2,000 questions. Data collection occurred in March 2023. Unfortunately, ChatGPT was correct only 52% of the time. Additionally, the answers were often more verbose compared to those from human experts. The researchers also compared this performance with the GPT-4 model, which performed slightly better by correctly answering 6 out of 21 randomly selected questions that GPT-3.5 had answered incorrectly. However, GPT-4 still generated a majority of incorrect responses (15 out of 21).

Alarmingly, the team found that study participants preferred the answers given by ChatGPT 35% of the time. Moreover, these participants often failed to notice the mistakes in ChatGPT's responses, overlooking incorrect answers 39% of the time.

More: https://techxplore.com/news/2024-05-scientists-chatgpt-inaccurate.html