Test Mission Template
The "Test Mission" template in Litmus provides a framework for evaluating your AI model's performance in multi-turn interactions, simulating a more realistic conversational flow or task-oriented dialogue. Unlike "Test Runs," which treat test cases as isolated units, "Test Missions" use the AI's responses from previous turns to guide subsequent interactions.
Structure
A "Test Mission" template is structured similarly to a "Test Run" template but with key differences:
- Template ID: A unique identifier for your template.
- Mission Data: An array of missions, each containing:
- Mission: A description of the overall task or goal the AI should achieve through the conversation.
- Mission Result: The expected outcome or state at the end of a successful mission. This is used as a golden response to evaluate the entire conversation using an LLM.
- Filter (optional): Comma-separated keywords or categories to organize and filter missions.
- Source (optional): A source identifier for the mission.
- Block (optional): A boolean (true/false) indicating whether this mission should be excluded from the run.
- Category (optional): A category or label for the mission.
- Mission Duration: An integer specifying the maximum number of interaction turns allowed for each mission.
- Request Payload: A JSON object defining the structure of the API request sent to the AI model, including a placeholder
{query}
for the AI's generated input. - Pre-Request and Post-Request (optional): JSON objects defining optional requests to be executed before and after each turn in the mission, respectively.
- LLM Evaluation Prompt (optional): A prompt guiding the LLM in evaluating the overall success of the mission based on the conversation history and the expected
Mission Result
. This prompt is used in theevaluate_mission
function.
Purpose
"Test Mission" templates aim to:
- Assess conversational fluency and coherence: Evaluate how well your AI maintains a natural and meaningful dialogue flow.
- Test goal-oriented behavior: Verify if your AI can achieve predefined tasks or goals through its interactions.
- Identify limitations in context understanding: Detect scenarios where your AI struggles to maintain context or follow instructions across multiple turns.
- Evaluate task completion and outcome: Assess the success of the mission based on the final state of the conversation and its alignment with the expected
Mission Result
.
Scenarios
Good Use Cases
- Evaluating chatbot or dialogue system performance: Assess how naturally and effectively your AI interacts in a conversational setting.
- Testing task-oriented agents: Verify if your AI can successfully complete tasks like booking appointments, ordering food, or providing customer support.
- Evaluating multi-step reasoning and problem-solving abilities: Assess your AI's capacity to break down complex goals into manageable steps and execute them through its interactions.
Bad Use Cases
- Single-turn interactions or simple question answering: "Test Missions" are overkill for scenarios where each input is independent. Use "Test Runs" for these simpler cases.
- Testing functionalities that don't involve conversational flow: Avoid using "Test Missions" for tasks like sentiment analysis or text classification, which don't require multi-turn interactions.
Starting a Test Mission
CLI
- Get your
RUN_ID
andTEMPLATE_ID
. - Run the following
litmus
CLI command:bashlitmus start $TEMPLATE_ID $RUN_ID
API (Simple)
- Construct a JSON payload:json
{ "run_id": "your-unique-run-id", "template_id": "your-template-id" }
- Send a POST request to the
/runs/submit_simple
endpoint.
API (Advanced)
- Construct a JSON payload:json
{ "run_id": "your-unique-run-id", "template_id": "your-template-id", "pre_request": { ... }, // Optional "post_request": { ... }, // Optional "test_request": { ... }, "template_type": "Test Mission", "mission_duration": <number_of_turns> // Required for Test Missions }
- Send a POST request to the
/runs/submit
endpoint.
UI
- Navigate to the "Start New Run" page.
- Select your "Test Mission" template.
- Enter your
RUN_ID
. - The UI should display input fields for
Mission Duration
and allow you to define or edit missions. - Submit the run.
Configuration
You can configure various aspects of a "Test Mission" template in the UI, including:
- Modifying the Mission Data: In the Mission tab, you can define or update missions by adding or removing mission items. Each mission item requires a mission description and an expected mission result. You can optionally provide additional data like filter, source, block, and category for each mission item.
- Modifying the Request Payload: In the Request Payload tab, modify the
Request Payload
to match your API requirements by using the built-in JSON editor. Use placeholders (e.g.,{query}
) for dynamic values from your test cases or missions. For "Test Missions", the {query} placeholder will be dynamically replaced with the LLM's generated requests in each turn. - Defining or updating the Pre-Request and Post-Request: Navigate to the Pre-Request and Post-Request tabs and use the built-in JSON editor to define or update the pre-request and post-request payloads.
- Crafting the LLM Evaluation Prompt: In the LLM Evaluation Prompt tab, enter your prompt in the text area.
- Selecting the appropriate Input Field and Output Field:
- Click on the "Input Field" button. A drawer will open displaying the "Request Payload" as a JSON Tree. Click on the node representing the field you want to use as input.
- Click on the "Output Field" button. Before selecting an output field, you need to run the request to get an example response. Once you have an example response, a drawer will open displaying the "Response Payload" as a JSON Tree. Click on the node representing the field you want to use as output for the assessment.
- Adjust Mission Duration: In the Missions tab, change the
Mission Duration
to control the length of the multi-turn interactions.
Example
For example, if your mission is to book a flight from San Francisco to New York, your mission description could be:
"Your mission is to book a flight for two adults from San Francisco to New York. The departure date should be December 20th, and the return date should be December 25th."
The following could be a possible dialogue:
- User: Book a flight from San Francisco to New York for two adults on December 20th, returning on December 25th.
- AI: Sure, I can help with that. What are your preferred airlines?
- User: I don't have a preference. Just find the cheapest option.
- AI: Okay, give me a moment to search for the best deals.
- AI: I found a flight on United Airlines for $450 per person. Would you like to proceed with booking?
- User: Yes, please.
- AI: Great! Can I have the names of the passengers as they appear on their IDs?
- User: John Doe and Jane Doe.
- AI: Thank you. Your flight is now confirmed. You'll receive a confirmation email shortly.
In this example, the LLM would evaluate whether the conversation led to a successful flight booking, considering the mission description and the conversation flow.
Restarting and Deleting
Similar to "Test Runs," you can restart a "Test Mission" using the /runs/invoke
API endpoint or the UI's restart button. Deletion is done through the /runs/<run_id>
endpoint or the UI's delete button.
Additional Information
- Each mission's conversation history and the final LLM assessment are stored in Firestore, allowing for detailed analysis.
- Consider the ethical implications of multi-turn interactions when crafting your mission descriptions and LLM prompts.
- Experiment with different
Mission Durations
to find a balance between comprehensive evaluation and resource efficiency.