“responsibility to test for real world applicability”

@omarsar0 via Twitter on Dec 11, 2023

The claim is that Mistral’s new models match or outperform GPT-3.5.

Beyond the standard benchmarks, are there any good examples or use cases where we are seeing this to be the case?

It’s hard to tell what models are better without actual use cases. There is a lot of unreliable examples but it’s important to measure capabilities and limitations in the context of real-world applications.

Unfortunately, benchmarks are just not enough anymore. I think LLM providers should have the additional responsibility to test for real world applicability as well and educate the community about it.

We are working on this @dair_ai and will have something to share soon.

Remains to be seen how good these models really are beyond the anecdotal examples. We hope to change this.

More on this effort soon!

Highlighting added.