Towards “benchmarking” democratization of good search.

draft

to be developed further… Feedback welcome!

The search-like uses of ChatGPT a year ago ignited the imaginations of many about what search could look like and prompted developers to start or accelerate projects applying generative AI to web search. In addition to OpenAI, the most-used applications of generative AI directly in either web search tools or special-purpose chatbots with Internet-browsing are from Google and Microsoft. Despite initial excitement about the potential for change, the dominance of a single search engine appears persistent.

While Google’s maintenance of its dominant position and its efforts to ensure continued social license to operate has advanced research and art in search, “the lack of authentic competition gives search users (including searchers, audience seekers, and subjects of searches) little voice or choice. We risk a continued and even increased”concentration of authority" (Solaiman et al., 2023), particularly in regards to the norms and values that shape the relevance ranking and the interactions supported by the search tools.

Thankfully there is continued research into underlying technologies, from new open-source models to new information retrieval techniques for use in generation, and in the use of these systems. There are also several companies engaged in innovative disruption. They are attempting to identify how to design interfaces and generate responses to best meet user needs in ways they can hope eventually to profitably sustain. Despite apparent competition on the edge, users still do not have sufficient tools or access for effectively evaluating different search systems, providing feedback, and repairing or extending them. While there is research on user audits in search, it has not been directed towards encouraging open competition or differentiation in new generative search tools.

Generative web search systems are designed to wrap around different base models. Such systems, like You.com or Perplexity AI, are often built atop OpenAI’s APIs (sometimes through Microsoft) but have added support for other private and open-source models as well as fine-tuned their own. Such systems have also built atop various sources of web data. They have used the Bing API, third-party APIs that provide search results from various search engines (though principally Google), and have built their own crawlers. Some of these search systems have provided their own APIs, either serving inference, search results and responses, or parsed webpage content. These existing systems have different optimizations, interactions, and complements.

We can advance research and design in this area by evaluating factors that shape users’ ability to use, make sense of, and critique these systems. We can then apply these factors to benchmark not only the outputs from models but also the broader searcher experience. These factors may include the ability to save and share searches, to use an API interface to facilitate evaluation at scale, effective and open feedback mechanisms, means of exploring the systems in their context of use and in comparison, and the ability to repair and extend. This can start with the development of a framework for evaluating these systems, then efforts to develop tools to advance the evaluation of these factors, users’ evaluation for their own contexts, and refining the factors themselves.

So, what organizations are best situated to apply their own resources and promote developer attention and energy to this problem and opportunity?

Next…

Read a continuation in Who is going to try to make web search better?

The initial draft of this was originally written in response to a generative question from Hugging Face (below). The terminology of benchmarking is adapted from the work of Hugging Face and the title above is a search-focused emendation of what is listed as the company’s mission: “Our mission is to democratize good machine learning.”

Please identify a policy position that you see as currently under-prioritized and would like to advocate for, and write a short recommendation to policymakers on how to prioritize it better.

References

Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S. L., au2, H. D. I., Dodge, J., Evans, E., Hooker, S., Jernite, Y., Luccioni, A. S., Lusoli, A., Mitchell, M., Newman, J., Png, M.-T., Strait, A., & Vassilev, A. (2023). Evaluating the social impact of generative ai systems in systems and society. https://doi.org/10.48550/arXiv.2306.05949 [solaiman2023evaluating]