Evaluation Guide¶
Run structured evaluations to confirm your agent calls the right tools, produces quality responses, and handles edge cases. Under the hood, evaluation uses the Gemini Enterprise Agent Platform GenAI Eval SDK to grade evaluations.
Upgrading from an older agents-cli?
If your project still has tests/eval/evalsets/*.evalset.json files from a previous version, see Migrating Eval Datasets for the new format.
Run Your First Evaluation¶
Your project includes a default dataset at tests/eval/datasets/basic-dataset.json and metrics configuration tests/eval/eval_config.yaml. Run it:
The output shows scores for each eval case against the configured metrics.
# Run for a custom dataset and different metrics
agents-cli eval generate --dataset tests/eval/datasets/custom-dataset.json --output custom_traces/
agents-cli eval grade --metrics general_quality --traces custom_traces/
Writing Eval Cases and Choosing Metrics¶
For full documentation on eval case schemas and available metrics, see the Gemini Enterprise Agent Platform Evaluation documentation.
Available Metrics Reference¶
You can choose from a wide range of built-in metrics depending on your agent's capabilities and the task at hand. To see the full list of available metrics, run:
Common metrics at a glance¶
A short reference for the most-used built-in metric IDs. Use agents-cli eval metric list for the full set with descriptions.
| Metric ID | What it grades |
|---|---|
general_quality |
Overall response quality with auto-generated content-based criteria. Recommended starting point for non-agent eval. |
text_quality |
Linguistic aspects: fluency, coherence, grammar. |
instruction_following |
How well the response adheres to specific constraints and instructions. |
tool_use_quality |
Tool selection, parameter accuracy, and step sequence correctness (single-turn). |
multi_turn_tool_use_quality |
Technical and semantic correctness of tool calls across a multi-turn conversation. |
multi_turn_trajectory_quality |
Sequential logic, efficiency, and error-recovery robustness across turns. |
multi_turn_task_success |
Whether the user's goal was fulfilled across the full multi-turn conversation. |
final_response_quality |
Comprehensive evaluation of the final response and intermediate tool usage. |
final_response_reference_free |
Final-response quality without a reference answer (requires custom rubrics). |
final_response_match |
Compares the agent's final response to a provided golden reference answer. |
hallucination |
Segments the response into atomic claims and verifies each against tool-returned context. |
grounding |
Factuality and consistency against provided context. |
safety |
Compliance against safety policies (PII, hate speech, dangerous content, harassment, sexual). |
Evaluation Configuration (eval_config.yaml)¶
The eval_config.yaml file specifies the metrics to run and defines custom metrics for grading evaluations.
metrics_to_run:
- response_under_500_chars
custom_metrics:
- name: response_under_500_chars
custom_function: |
def evaluate(instance: dict) -> dict:
response = instance.get("response") or {}
text = "".join(
p.get("text", "") for p in (response.get("parts") or []) if p.get("text")
)
passed = len(text) <= 500
return {
"score": 1.0 if passed else 0.0,
"explanation": f"Final response is {len(text)} chars (limit 500).",
}
- name: response_quality_rubric
prompt_template: |
Rate the agent's response 1-5 for helpfulness and accuracy.
Prompt: {prompt}
Final response: {response}
Full trace (for tool-call and reasoning context): {agent_data}
Return JSON: {"score": <1|2|3|4|5>, "explanation": "<reason>"}
judge_model: gemini-flash-latest
judge_model_sampling_count: 3
Each custom metric must conform to either the Code Execution Metric or LLM-as-a-Judge Metric (LLMMetric) schema:
- Code Execution Metric: Used to run custom Python code for evaluation. Must have a name and a custom_function (containing a def evaluate(instance): signature). By default, the function executes locally in the CLI process — no GCP project or region is required, but the user-supplied code runs with the CLI's privileges. Add "execution": "remote" to opt into Vertex AI's sandboxed CodeExecutionMetric (server-side), which requires a configured GCP project + region.
- LLM-as-a-Judge Metric: Used to evaluate responses using an LLM judge. Must have a name and a prompt_template. Optional fields include rubric_group_name, judge_model (e.g., gemini-flash-latest), and judge_model_sampling_count (between 1 and 32).
Quick Reference for Common Scenarios¶
- Agents with custom function tools — Use
tool_use_quality(for single-turn) ormulti_turn_tool_use_quality+multi_turn_trajectory_quality(for multi-turn). - RAG agents — Use
grounding+hallucination+safety. - Conversational assistants — Use
general_qualityormulti_turn_general_quality. - Goal-oriented agents — Use
multi_turn_task_success.
The Eval-Fix Loop¶
Evaluation is iterative. Expect 5-10+ cycles before your agent consistently passes.
- Write 1-2 core eval cases covering the most important behavior.
- Run:
agents-cli eval generatefollowed byagents-cli eval grade - Read the results — which cases failed and why.
- Fix — adjust the agent's instruction, tools, or logic.
- Re-run:
agents-cli eval generateandagents-cli eval grade - Expand — once core cases pass, add edge cases and new scenarios.
Beyond generate and grade¶
generate and grade form the inner loop, but the eval surface has a few more commands worth knowing about. Each is a separate step you reach for as your eval setup matures.
agents-cli eval dataset synthesize¶
Bootstraps a dataset by inspecting your local ADK agent and generating multi-turn conversation scenarios for it — no input file required. Useful for cold-starting evaluation on a new agent or expanding coverage without writing every case by hand. Each generated case includes a starting user message, a conversation plan, and the full agent trace produced by playing the scenario out against an LLM-backed user simulator.
Steer what gets generated with --instruction (e.g. "Scenarios where the user changes their mind") and --environment-context (e.g. "Today is Monday. Flights to Paris are available."). The output is a regular *-dataset.json file you can edit, commit, and feed back into eval grade directly (the trace is already populated, so you can skip eval generate).
agents-cli eval compare¶
Compares two grade results side by side so you can see whether a change actually improved things.
A typical use is comparing a "before fix" run against an "after fix" run during the eval-fix loop.
agents-cli eval analyze¶
Clusters failure modes from a grade-results file into themes, so you can see what kinds of things are going wrong instead of skimming individual cases.
agents-cli eval metric list¶
Prints every built-in metric the SDK supports, with a short description for each. The starting point when you want to know what's available beyond the common metrics table above.
agents-cli eval optimize¶
Once your evals are in place, eval optimize uses them to automatically tune your agent's prompts.
A run takes anywhere from a few minutes to several hours depending on dataset size and metric complexity, so it's not something to run over and over. Reach for it after simpler approaches (rewriting the prompt yourself, adjusting metrics, fixing failing cases by hand) have run their course.
For full documentation on eval case schemas, metrics, and user simulation, see the Gemini Enterprise Agent Platform Evaluation documentation.