The LLM benchmark we need:
ChatGPT-like website that always shows two responses, generated by any two of N different models (user can’t see which).
The user has to select the better response in order to keep using the chat (it’s otherwise free).
Leaderboard will be decisive.
I quote-tweeted the above on Dec 10th, pasted with some edits below. The implicit social-search-request and responses was a great reminder of the work from LMSYS.org ( website | twitter | Zheng et al. (2023)). I have marked the most substantive edits.
It would be great for the generative search tools to support this interaction as an option. Even if only to help the users in gauging which models they prefer for different queries. Cc: Hugging Face, Perplexity AI, Phind Search, You.com.
For instance, currently Perplexity AI lets Pro users choose between Copilot or not, and in the settings they can choose different “AI Models”:
Users can also click a button below responses to Rewrite the response (choosing amongst those models).
Relatedly, You.com lets users choose between GPT-4 mode or not, and also choose the “Safe search: Off (uncensored chat)” option to use the Zephyr 7B alpha model instruct fine-tune of Mistral 7B.
Why not let users sometimes do masked multi-searching (perhaps counting against their quota) to see which model they actually prefer?
Do other generative search (or chatbot + search) systems allow users to choose between different models?
Hugging Face Chat (which has a web search option), allows users to choose between several models:
I should see which of these is currently most amenable to a Userscript or browser extension switching models silently.
Or maybe this already exists too? Or LMSYS.org plans to build [something more expansive that supports comparisons across search systems rather than model-choice within the systems]?
If not directly providing access to each system’s responses, one could also mix-and-match sources of web search results as well as reprompting/autopromopting, reranking, RAGish architectures, related query pipelines/models, etc.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. http://arxiv.org/abs/2306.05685 [zheng2023judging]