Many folks look to take various prompts and test the LLMs against the outputs while looking at some local measure of accuracy or something similar. For instance, Kabir et al. (2023), recently popular on Twitter, looks at “correctness”, among other qualities. That is all fine and well, if some determination of a mapping between prompts-to-outputs and pre-determined (or post hoc calculated) answers supports some understanding, for their purposes, about the capabilities of the technical systems in question.
I’m generally more interested in how the broader sociotechnical system (the human or organization using the tool) performs: how do people perceive and perform-on or perform-with the outputs?1 This connects with my dissertation research into the seemingly successful search practices of data engineers. I did not look at how perfect queries gave data engineers perfect search results or how they filtered results to extract out only accurate ones, but how they used web search to inch towards addressing their problems. This included evaluating results in multiple paths (where the value of results to them was neither binary nor dependent on some abstracted notion of ‘accuracy’).2
See, for instance, this line from Simon Willison (in a TIL write-up of his about his use of GPT-4 to help him write a small tool)3:
That wasn’t exactly what I needed, but it was very easy to edit that into the final program.
So, to me there is a big difference between evaluating atomic text (even “systematic” evaluations) & evaluating how various people might take it up within their particular situation.
This connects strongly with Mulligan and my use of “results-of-search” (Mulligan & Griffin, 2018) to distinguish the raw “search results” from how people perceive and use them. Lurie & Mulligan (2021) take up that idea in their analysis of Google’s search results for queries meant to identify congressional representatives. They do not focus on mere inaccuracy (though one could!) but on whether the search results are likely to mislead.
See “Reminders” and mentions of “contexts-of-use” at these pages to see how I have discussed evaluating results as an external observer, looking at factors as wide as “reading ease”, “quality of advice” (distinct from providing concrete and singular answers), and not being misleading:
You could evaluate on the basis of factors other than accuracy. Liu et al. (2023) look at “verifiability”:
A prerequisite trait of a trustworthy generative search engine is verifiability, that is, each generated statement about the external world should be fully supported by a set of in-line citations, and each provided citation should support its associated statement. Verifiability enables readers to easily check that any generated statement is supported by its cited source. [internal footnotes omitted; emphasis in original]
Or, perhaps, relevance. I have not yet seen it applied narrowly to generative search4? But “relevance” also stands distinct from accuracy (and draws on the recognition that many questions do not have lend themselves to responses graded by accuracy)—thought it introduces many problems of its own. One particular examination of relevance in search results is that of “societal relevance” (Sundin et al., 2021). This moves around, aside, or above narrow “contexts-of-use” (and user intent) and appears related to the efforts to shape outputs through Reinforcement Learning from Human Feedback or Constitutional AI, among others (many of the questions raised by Sundin et al. (in § 5. IDEAS FOR FUTURE RESEARCH) are applicable here).
As I was preparing to share my a toothpick, a bowl of pudding, a full glass of water, and a marshmallow post, I was reflecting on these differences.
I wrote this in a brief addendum at the above post:
I want to know if users are hurting themselves or others as they imagine, talk about, and use the tools. I want to know how we can know that. I want to know how we can shift the design and use of the tools to encourage particular downstream effects. Someone considering picking up or prohibiting a particular tool will have very different considerations based on their specific situation. We need situated observations of tool use, not just system transparency or systematic evaluations.
And I shared from Lurie & Mulligan in a follow-on tweet:
Kabir et al. (2023) does engage somewhat in this, but I’d like to see stronger arguments made for how their user study improves our understanding of LLM use ‘in the wild’. Key issues I see: (1) The programming problems, and (2) their phrasing, are not identified and developed in the course of their work. (Granted, sometimes people step in to help others problem solve, and so didn’t choose the problem searched or the phrasing of a problem, but this is not particularly naturalistic. It reminds me of Google’s Search Quality Rating Guidelines, decontextualized ‘user intent’ or , and the limits of objectivity. As Bilić (2016) writes (p. 1): “The Search Quality Rating Guidelines document provides a rare glimpse into the internal architecture of search algorithms and the notions of utility and relevance which are presented and structured as neutral and objective.” Like Meisner et al. (2022) discuss re Google’s own human evaluators, even though this study looks at real user queries, evaluations are standardized (and unable to assess whether needs were met) because the evaluators/participants cannot actually “represent the user” (p. 3).) (3) The participants do not have opportunity to test the answers by running the code. (Rather, “participants were encouraged to use external verification methods such as Google search, tutorials, and documentation” (p. 5)) (4) The answers are not used as inputs in some larger work output so measures of success are artificial (“correctness, consistency, comprehensiveness, and concisenes”, user stated preference).↩︎
See, for instance, Ch. 4. Extending searching > Spaces for evaluation↩︎
I previously noted this in a tweet: “Being comfortable with this seems pretty key”↩︎
Re ‘relevance’ as a dimension in evaluating generated outputs. I have not much looked either. OpenAI’s GPT-4 Technical Report is focused on factuality, accuracy, safety. Is this perhaps because of the design of the NLP benchmarks that are driving some of this research? Glancing at a few documents from Google about Bard also reveals little regard to relevance, see the emphasis on these benchmarks in the § Evaluating PaLM 2 in an intro from Google. Anthropic’s Introduction to Prompt Design makes mention of the concept only in a cursory introduction to core terms used to describe interaction with the models: “The text that you give Claude is designed to elicit, or”prompt“, a relevant output. A prompt is usually in the form of a question or instructions.” [emphasis added] Again, I have not done a thorough review on this question—and am thinking-by-writing here. This seeming absence of relevance from discussion is striking also given that the Shah & Bender (2022) paper does engage with relevance at throughout. The Metzler et al. (2021), one of the foils for Shah and Bender, seems to use “relevance” largely dismissively, as a criterion for only those traditional information retrieval systems (see also: rethinking rethinking search). They include the following in their non-exhaustive list of properties exhibited in a high-quality response:
Perhaps relevance is assumed?↩︎
Bilić, P. (2016). Search algorithms, hidden labour and information control. Big Data & Society, 3(1), 205395171665215. https://doi.org/10.1177/2053951716652159 [bilić2016search]
Kabir, S., Udo-Imeh, D. N., Kou, B., & Zhang, T. (2023). Who answers it better? An in-depth analysis of chatgpt and stack overflow answers to software engineering questions. http://arxiv.org/abs/2308.02312 [kabir2023answers]
Liu, N. F., Zhang, T., & Liang, P. (2023). Evaluating verifiability in generative search engines. https://doi.org/10.48550/arXiv.2304.09848 [liu2023evaluating]
Lurie, E., & Mulligan, D. K. (2021). Searching for representation: A sociotechnical audit of googling for members of U.S. Congress. https://arxiv.org/abs/2109.07012 [lurie2021searching_facctrec]
Meisner, C., Duffy, B. E., & Ziewitz, M. (2022). The labor of search engine evaluation: Making algorithms more human or humans more algorithmic? New Media & Society, 0(0), 14614448211063860. https://doi.org/10.1177/14614448211063860 [meisner2022labor]
Metzler, D., Tay, Y., Bahri, D., & Najork, M. (2021). Rethinking search: Making domain experts out of dilettantes. SIGIR Forum, 55, 1–27. https://research.google/pubs/pub50545/ [metzler2021rethinking]
Mulligan, D. K., & Griffin, D. (2018). Rescripting search to respect the right to truth. The Georgetown Law Technology Review, 2(2), 557–584. https://georgetownlawtechreview.org/rescripting-search-to-respect-the-right-to-truth/GLTR-07-2018/ [mulligan2018rescripting]
Shah, C., & Bender, E. M. (2022, March). Situating search. ACM SIGIR Conference on Human Information Interaction and Retrieval. https://doi.org/10.1145/3498366.3505816 [shah2022situating]
Sundin, O., Lewandowski, D., & Haider, J. (2021). Whose relevance? Web search engines as multisided relevance machines. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24570 [sundin2021relevance]