The Need for ChainForge-like Tools in Evaluating Generative Web Search Platforms

    This is a provocation.

    This post emphasizes the need to evaluate generative search tools and platforms comprehensively. Beyond assessing prompt-response pairs, we also need new tools that can help us critically analyze user interface features.

    Ian Arawjo (website | Twitter)’s ChainForge (2023) is an

    open-source visual programming environment for prompt engineering. With ChainForge, you can evaluate the robustness of prompts and text generation models in a way that goes beyond anecdotal evidence. [source: chainforge.ai]

    I’d love to see something like ChainForge to help evaluate generative web search platforms (like You.com’s YouChat, Perplexity AI, Phind, Metaphor, or Andi) and web search engines that provide generative search components (Google’s SGE, Brave Search’s Summarizer, etc.)

    ChainForge is pretty slick and I’d love to dig into it further1. I’ve followed Arawjo’s discussion of the tool with interest since the spring (search: Twitter[chainforge (IanArawjo or filter:follows)] but only just tried using it as (1) I’m generally more interested in the generative search engines (which do not seem to provide API access for this sort of purpose but maybe I am wrong) and (2) I’m more focused on how people perceive and perform-with tool outputs than the outputs themselves (so a more meta version of evaluating results). That said, playing with ChainForge had been on my todo list for awhile and clearly [people could use this as a search tool itself (just as one can use chatbots like OpenAI’s ChatGPT)!

    How do I evaluate generative web search platforms now?

    When I’m comparing different generative web search platforms or tools I’m taking screenshots, copying the text responses, and/or sharing links to the results.2 I’m looking at the generated text and links and other elements of the search experience, including speed and UI features.

    It would be great to have tools to make those steps more trustworthy, transparent, and fluid-though still seamful (Eslami et al., 2016). Even without API access there may be automated scrapping that can be done in the public interest, or someone could write software to extract from screenshots, shared links of results, or from live search sessions (through browser extensions). I imagine some SEO firms, academic labs, and others are developing tools to provide some of this support.3

    There are definitely some competitive concerns for the companies. That said we could all benefit from identifying where different search platforms excel in content, speed, and various UI features (including contestability) or where searchers, subjects of searches, and creators of content suffer without knowledge or recourse.4 It would be very valuable to let the general public have some access to something like a ChainForge for search user interfaces as well.

    What’s next?

    While constructing such a system is currently beyond my skill level, the range of available code generation and search tools tempts me to give it a shot. I’d love to talk with anyone thinking about building or using tools like this.


    If you are building with or evaluating large language models I encourage you to check out ChainForge and please suggest similar tools to me.


    Footnotes

    1. I’ve used it only twice now: toying with a question about assigning expert roles in prompts and (looking at (and documenting) how my own prompts for a coding question I had could be better).↩︎

    2. There are some search user interfaces that do NOT provide interfaces for sharing conversations or interactions.↩︎

    3. See for example these existing tools and resources: Ronald Robertson (website | Twitter)’s search auditing tools(WebSearcher and suggests), and the HAW Hamburg Search Studies research group’s Result Assessment Tool (RAT), The Markup has built tools like Simple Search (mentioned here), and there are companies like SerpAPI and Serper↩︎

    4. This isn’t an appeal for mandated disclosure, although perhaps a subset of particular public interest results should be regularly monitored and published. See re Dave Guarino (website | Twitter; “the founding engineer (and then Director) of GetCalFresh.org at Code for America”)’s comments on this more broadly: “We really need to talk more about monitoring search quality for public interest topics.”.↩︎

    References

    Arawjo, I., Vaithilingam, P., Swoopes, C., Wattenberg, M., & Glassman, E. (2023). ChainForge. https://www.chainforge.ai/. [arawjo2023chainforge]

    Eslami, M., Karahalios, K., Sandvig, C., Vaccaro, K., Rickman, A., Hamilton, K., & Kirlik, A. (2016). First i "like" it, then i hide it: Folk theories of social feeds. Proceedings of the 2016 Chi Conference on Human Factors in Computing Systems, 2371–2382. https://doi.org/10.1145/2858036.2858494 [eslami2016first]

    Segura, S., Towey, D., Zhou, Z. Q., & Chen, T. Y. (2020). Metamorphic testing: Testing the untestable. IEEE Software, 37(3), 46–53. https://doi.org/10.1109/MS.2018.2875968 [segura2020metamorphic]