ChatGPT, OpenAI’s fabulating chatbot, produces wrong answers to software programming questions more than half the time, according to a study from Purdue University. That said, the bot was convincing enough to fool a third of participants.
The Purdue team analyzed ChatGPT’s answers to 517 Stack Overflow questions to assess the correctness, consistency, comprehensiveness, and conciseness of ChatGPT’s answers. The US academics also conducted linguistic and sentiment analysis of the answers, and questioned a dozen volunteer participants on the results generated by the model.
“Our analysis shows that 52 percent of ChatGPT answers are incorrect and 77 percent are verbose,” the team’s paper concluded. “Nonetheless, ChatGPT answers are still preferred 39.34 percent of the time due to their comprehensiveness and well-articulated language style.” Among the set of preferred ChatGPT answers, 77 percent were wrong.
OpenAI on the ChatGPT website acknowledges its software “may produce inaccurate information about people, places, or facts.” We’ve asked the lab if it has any comment about the Purdue study.
Only when the error in the ChatGPT answer is obvious, users can identify the error
The pre-print paper is titled, “Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions.” It was written by researchers Samia Kabir, David Udo-Imeh, Bonan Kou, and assistant professor Tianyi Zhang.
“During our study, we observed that only when the error in the ChatGPT answer is obvious, users can identify the error,” their paper stated. “However, when the error is not readily verifiable or requires external IDE or documentation, users often fail to identify the incorrectness or underestimate the degree of error in the answer.”
Even when the answer has a glaring error, the paper stated, two out of the 12 participants still marked the response preferred. The paper attributes this to ChatGPT’s pleasant, authoritative style.
“From semi-structured interviews, it is apparent that polite language, articulated and text-book style answers, comprehensiveness, and affiliation in answers make completely wrong answers seem correct,” the paper explained.
They do say always be polite…
“The cases where participants preferred incorrect and verbose ChatGPT’s answers over Stack Overflow’s answers were due to several reasons, as reported by the participants,” Samia Kabir, a doctoral student at Purdue and one of the paper’s authors, told The Register.
“One of the main reasons was how detailed ChatGPT’s answers are. In many cases, participants did not mind the length if they are getting useful information from lengthy and detailed answers. Also, positive sentiments and politeness of the answers were the other two reasons.
“Participants ignored the incorrectness when they found ChatGPT’s answer to be insightful. The way ChatGPT confidently conveys insightful information (even when the information is incorrect) gains user trust, which causes them to prefer the incorrect answer.”
Kabir said the user study is intended to complement the in-depth manual and large-scale linguistic analysis of ChatGPT answers.
“Nevertheless, it would always be beneficial to have a bigger sample size,” she said. “We also welcome other researchers to reproduce our study – our dataset is publicly available to foster future research.”
The authors observe that ChatGPT answers contain more “drives attributes” – language that suggests accomplishment or achievement – but doesn’t describe risks as frequently as Stack Overflow posts.
“On many occasions we observed ChatGPT inserting words and phrases such as ‘of course I can help you’, ’this will certainly fix it’, etc,” the paper stated.
Among other findings, the authors found ChatGPT is more likely to make conceptual errors than factual ones. “Many answers are incorrect due to ChatGPT’s incapability to understand the underlying context of the question being asked,” the paper found.
The authors’ linguistic analysis of ChatGPT answers and Stack Overflow answers suggests the bot’s responses are “more formal, express more analytic thinking, showcase more efforts towards achieving goals, and exhibit less negative emotion.” And their sentiment analysis concluded ChatGPT answers express “more positive sentiments” than Stack Overflow answers.
Kabir said, “From our findings and observation from this research, we would suggest that Stack Overflow may want to incorporate effective methods to detect toxicity and negative sentiments in comments and answers in order to improve sentiment and politeness.
“We also think that Stack Overflow may want to improve the discoverability of their answers to help in finding useful answers. Additionally, Stack Overflow may want to provide more specific guidelines to help answerers structure their answers, eg: in a step-by-step, detail-oriented manner.”
Stack Overflow versus an overflowing stack
There’s some positive news here for Stack Overflow, which in 2018 was called out for being the source of incorrect code snippets in about 15 percent of 1.3 million Android apps. In the study 60 percent of respondents found the (presumably) human-authored answers to be more correct, concise and useful.
Nonetheless, Stack Overflow’s use seems to have declined, though the amount is disputed. It appears traffic has been down six percent every month since January 2022 and was down 13.9 percent in March, according to an April report from SimilarWeb that suggested usage of ChatGPT may be contributing to the decline.
Community members from Stack Exchange, the network of Q&A sites that includes Stack Overflow, have apparently come to a similar conclusion , based on a drop in new question activity, new answers being posted to the site, and in new user registrations.
Stack Overflow, under new ownership since 2021, disagreed with SimilarWeb’s assessment in an email to The Register.
A spokesperson said the biz in May 2022 recategorized its analytics cookie from a “Strictly Necessary” to a “Performance” cookie and, in September 2022 shifted to Google Analytics version 4, both of which affect traffic reporting and comparisons over time.
“Although we have seen a small decline in traffic, in no way is it what the graph is showing,” the company spokesperson told us. “This year, overall, we’re seeing an average of ~5 percent less traffic compared to 2022.
“That said, Stack Overflow’s traffic, along with traffic to many other sites, has been impacted by the surge of interest in ChatGPT over the last few months. In April of this year, we saw an above average traffic decrease (~14 percent), which we can likely attribute to developers trying GPT-4 after it was released in March. Our traffic also changes based on search algorithms, which have a big influence on how our content is discovered.”
Asked about the study’s findings, Stack Overflow’s spokesperson said no one at the outfit had time to explore the report.
“We know there is no shortage of ways how developers can leverage AI, however from our own findings, there is one core deterrent in its adoption – trust in the accuracy of AI-generated content,” the rep said.
“Stack Overflow’s annual Developer Survey of 90,000 coders recently found that 77 percent of developers are favorable of AI tools, but only 42 percent trust the accuracy of those tools. OverflowAI developed with community at the core and with a focus on the accuracy of data and AI-generated content.
“With OverflowAI, we are offering the ability to check, validate, attribute and confirm accuracy and trustworthiness across the Stack Overflow community and its more than 58 million questions and answers.”