Evaluating LLMs with Ragas
This document provides a comprehensive guide to enabling, using, configuring, and extending Ragas (RAG Assessment) within the Litmus framework for evaluating LLM responses.
What is Ragas?
Ragas is a Python library designed for evaluating the quality and factuality of responses generated by LLMs in question-answering tasks. It provides a suite of metrics that assess various aspects of LLM responses, including:
- Answer Relevancy: Measures how well the answer addresses the question.
- Context Recall: Evaluates how much relevant information from the provided context is included in the answer.
- Context Precision: Assesses the focus and conciseness of the answer by measuring the proportion of retrieved context that is actually used in the answer.
- Harmfulness: Checks for potentially harmful or inappropriate content in the answer.
- Answer Similarity: Measures the similarity between the generated answer and a reference answer (e.g., a human-written answer).
Enabling Ragas in Litmus
By default, Ragas evaluation is disabled in Litmus. To enable it, you need to modify your test templates:
- Edit your test template:
- In the Litmus UI, navigate to the "Templates" page and click the "Edit" button next to the template you want to modify.
- Enable Ragas in the "LLM Evaluation Prompt" tab:
- Check the checkbox for Ragas.
- Save your template:
- Click the "Update Template" button to save your changes.
Using Ragas
Once Ragas is enabled, Litmus will automatically use it to evaluate LLM responses for test runs that utilize the modified template. The results are embedded within the assessment
field of the test case:
{
"status": "Passed",
"response": {
"output": "This is the answer"
},
"assessment": {
"ragas_evaluation": {
"answer_relevancy": 1.0,
"context_recall": 1.0,
"context_precision": 1.0,
"harmfulness": 0.0,
"answer_similarity": 1.0
}
}
}
Configuring and Extending Ragas
Currently, Litmus utilizes a predefined set of Ragas metrics, including answer relevancy, context recall, context precision, harmfulness, and answer similarity. Extending this set or adjusting metric thresholds would require code modifications within the worker service.
For instance:
- Adding new metrics: To include additional Ragas metrics like
coherence
orfactuality
, you would need to modify theragas_metrics
list in theragas_eval.py
file within the worker service code. - Adjusting thresholds: To modify the default thresholds for determining pass/fail, you would need to adjust the metric objects within the
ragas_eval.py
file.
Note: These modifications involve code changes and require rebuilding and redeploying the worker Docker image.
Important Notes
- Ragas evaluation relies on the availability of context along with the question and answer. Ensure that your test cases and templates provide appropriate context for meaningful Ragas assessments.
- Consider the limitations and potential biases of the underlying LLM used for both generating responses and performing Ragas evaluations.
- Regularly review and update your evaluation strategies and metrics as your LLM applications evolve and new Ragas features become available.