52% are incorrect or 52% contain inaccuracies?

Kabir et al. (2023) (a preprint on arXiv) compares ChatGPT 3.5 Turbo API answers¹ with Stack Overflow answers to 517 questions found on Stack Overflow. Two attention-grabbing tweets about the paper with slightly different quotes raised questions for me about their method for identify an answer as incorrect:

Qualifier: Kabir et al. (2023) is more than the exciting takes on Twitter.

@timnitGebru via Twitter on Aug 9, 2023

“Our analysis shows that 52% of ChatGPT answers are incorrect and 77% are verbose…ChatGPT answers are still preferred 39.34% of the time due to their comprehensiveness and well-articulated language style.”
Great that Stack Overflow is being destroyed by OpenAI +friends.

emphasis added

@GaryMarcus via Twitter on Aug 10, 2023

Is there a name for this? “Our examination revealed that 52% of ChatGPT’s answers contain inaccuracies and 77% are verbose. Nevertheless, users still prefer ChatGPT’s responses 39% of the time due to their comprehensiveness and articulate language style.”

emphasis added

While I didn’t find it explicitly noted in the paper, they seem to mark an answer as a whole “incorrect” if it contains any anything incorrect among four types? Here is p. 4:

For Correctness, we compared ChatGPT answers with the accepted SO answers and also resorted to other online resources such as blog posts, tutorials, and official documentation. Our codebook includes four types of correctness issues— Factual, Conceptual, Code, and Terminological incorrectness. Specifically, for incorrect code examples embedded in ChatGPT answers, we identified four types of code-level incorrectness—Syntax errors and errors due to Wrong Logic, Wrong API/Library/Function Usage, and Incomplete Code.

This answer-level “correctness” is striking to me (given my interest in how people perceive and perform-with tool outputs). For many multi-part problems a single incorrect step in an answer produces a failure. But that does not hold for all problems. There are many problems that are resolved iteratively, where drafts of initial answers are build upon for subsequent answers.

Aside: I do not believe it is clear that “Stack Overflow is being destroyed” nor what the costs or causes may be. Clearly a full analysis of LLM answers would need to incorporate many downstream interactions with the same people and systems that produce so much training data, but it seems strange to me that more granularity is not used here, especially since practical utility tests were not conducted or generalized-to.

Footnotes

A common critique on Twitter is that the paper did not evaluate answers from the latest model, GPT-4. That is a critique that should at least be addressed in revisions of the paper.↩︎

References

Kabir, S., Udo-Imeh, D. N., Kou, B., & Zhang, T. (2023). Who answers it better? An in-depth analysis of chatgpt and stack overflow answers to software engineering questions. http://arxiv.org/abs/2308.02312 [kabir2023answers]