“Product human evals are what matter to us”

@AravSrinivas via Twitter on Dec 01, 2023

I was asked why we never published metrics relative to GPT-4 for the pplx-chat and pplx-online LLMs and only compared to 3.5-turbo. Just like everyone else today, we are far away from GPT-4 capabilities, even on the narrow task of answering questions accurately with search grounding. Product human evals are what matter to us, not academic evals that can be gamed. Our own data flywheel and better base models are necessary ingredients to getting there. Mistral and Meta are doing incredible work to help the community to get closer. But at the same time, it’s important to acknowledge OpenAI’s tremendous work on GPT 4.