Eval Definition
Conversation Resolution
Evaluates whether each user query or statement in a conversation receives an appropriate and complete response from the AI. This metric assesses if the conversation reaches satisfactory conclusions for each user interaction, ensuring that questions are answered and statements are appropriately acknowledged.
Evaluation Using Interface
Input:
- Required Inputs:
- output: column containing conversation history between the user and the model
Output:
- Score: percentage score between 0 and 100
Interpretation:
- Higher scores: Indicate that the conversation is resolved.
- Lower scores: Suggest that the conversation is not resolved.
Evaluation Using Python SDK
Click here to learn how to setup evaluation using the Python SDK.
Input:
- Required Inputs:
- messages:
list[string]
- conversation history between the user and the model provided as query and response pairs
- messages:
Output:
- Score:
float
- returns score between 0 and 1
Interpretation:
- Higher scores: Indicate that the conversation is resolved.
- Lower scores: Suggest that the conversation is not resolved.
What to do when Conversation Resolution is Low
- Add confirmation mechanisms to verify user satisfaction
- Develop fallback responses for unclear or complex queries
- Track common patterns in unresolved queries for improvement
- Consider implementing a clarification system for ambiguous requests
Comparing Conversation Resolution with Similar Evals
- Conversation Coherence: While Resolution focuses on addressing user needs, Coherence evaluates the logical flow and context maintenance. A conversation can be perfectly coherent but fail to resolve user queries, or vice versa.
- Completeness: Resolution differs from Completeness as it focuses on satisfactory conclusion rather than comprehensive coverage. A response can be complete but not resolve the user’s actual need.
- Context Relevance: Resolution evaluates whether queries are answered, while Context Relevance assesses if the provided context is sufficient for generating responses. A response can use relevant context but still fail to resolve the user’s query.
Was this page helpful?
Previous
Deterministic EvalEvaluates whether an output is deterministic or not by following specific rules or patterns. This evaluation is particularly versatile as it can be applied across multiple modalities including text, images, conversations, and custom outputs. It verifies if the generated content adheres to predefined rules, formats, or expected patterns.
Next