Evaluation Guide¶
Run structured evaluations to confirm your agent calls the right tools, produces quality responses, and handles edge cases. Under the hood, agents-cli eval run uses the ADK eval CLI to run evaluations with LLM-as-judge scoring.
Run Your First Evaluation¶
Your project includes a default eval set at tests/eval/evalsets/basic.evalset.json and scoring criteria at tests/eval/eval_config.json. Run it:
The output shows scores for each eval case against the configured rubrics. A score above the threshold (default: 0.8) passes.
# Run a specific eval set
agents-cli eval run --evalset tests/eval/evalsets/custom.evalset.json
# Run all eval sets in the project
agents-cli eval run --all
Writing Eval Cases and Choosing Metrics¶
For full documentation on eval case schemas, available metrics, rubric writing, and multi-turn testing, see the ADK Evaluation Guide.
Quick reference for choosing metrics:
- Agents with custom function tools — use
tool_trajectory_avg_score+rubric_based_final_response_quality_v1. - Agents with
google_searchor model-internal tools — use onlyrubric_based_final_response_quality_v1(model-internal tools don't appear in trajectory). - RAG agents — use
rubric_based_final_response_quality_v1+hallucinations_v1.
The Eval-Fix Loop¶
Evaluation is iterative. Expect 5-10+ cycles before your agent consistently passes.
- Write 1-2 core eval cases covering the most important behavior.
- Run:
agents-cli eval run - Read the results — which cases failed and why.
- Fix — adjust the agent's instruction, tools, or logic.
- Re-run:
agents-cli eval run - Expand — once core cases pass, add edge cases and new scenarios.
For full documentation on eval case schemas, metrics, rubrics, and user simulation, see the ADK Evaluation Guide.