ragas metrics

github.com/explodinggradients/ragas:

Ragas measures your pipeline’s performance against different dimensions

Faithfulness: measures the information consistency of the generated answer against the given context. If any claims are made in the answer that cannot be deduced from context is penalized.

Context Relevancy: measures how relevant retrieved contexts are to the question. Ideally, the context should only contain information necessary to answer the question. The presence of redundant information in the context is penalized.

Context Recall: measures the recall of the retrieved context using annotated answer as ground truth. Annotated answer is taken as proxy for ground truth context.

Answer Relevancy: refers to the degree to which a response directly addresses and is appropriate for a given question or context. This does not take the factuality of the answer into consideration but rather penalizes the present of redundant information or incomplete answers given a question.

Aspect Critiques: Designed to judge the submission against defined aspects like harmlessness, correctness, etc. You can also define your own aspect and validate the submission against your desired aspect. The output of aspect critiques is always binary.

ragas is mentioned in SearchRights.org & in LLM frameworks
HT: Aaron Tay
I looked back at my comments on the OWASP . One concern I had there was:

“inadequate informing” (wc?), where the information generated is accurate but inadequate given the situation-and-user.

It doesn’t seem that these metrics directly engage with that, though aspect critiques could include it. I think this concerns pays more into what the ‘ground truth context’ is and how flexible these pipelines are for wildly different users asking the same strings of questions but hoping for and needing different responses. Perhaps I’m pondering something more like old-fashioned user relevance, which may be much more new and hot with generated responses.