Evaluate Conversation
In the conversational AI, evaluating the quality of interaction is critical for improving the ability of dialogue system to generate meaningful and effective responses. These evaluations provide a systematic way to assess how well a model performs in maintaining the conversational flow, resolving user queries, and adhering to contextual relevance.
This evaluation is needed in situation like testing chatbot or virtual assistant interactions to ensure they meet user needs. Monitoring deployed systems to identify where conversations fail to resolve or comparing models to determine which one performs better in resolving user queries.
The primary input to this evaluation is a list of messages that make up the conversation. These messages are expected to be strings representing the dialogue between a user and an AI.
The LLM agent sends the organised input to an LLM service, which analyses the interaction and provides a structured output. This output specifies whether each part of the interaction is resolved, unresolved, or partially resolved, along with explanations.
The scoring pipeline calculates a score based on the resolution status of each message. Fully resolved messages contribute more to the score than partially resolved ones. The scoring pipeline calculates a score based on the resolution status of each message. The score is then compared against a failure threshold to determine if the conversation is considered a failure.
The evaluation results are then return, which includes the score, failure status, reasons for unresolved messages, and other metadata.
The following two evals focuses on critical aspects of conversational quality, addressing both the resolution of user queries and the logical flow of interactions:
1. Conversation Resolution
This evaluation examines a series of exchanged messages between the user and the AI system. It uses an LLM to analyse the interaction and determine whether the conversation reached a satisfactory resolution. The analysis results in a score representing the level of resolution achieved.
Returns an output score. A higher score indicates that the conversation was resolved effectively. A lower score points to incomplete, unclear, or unresolved conversations.
The Conversation Resolution evaluation enables developers to:
- Spot unresolved interactions and improve system responses.
- Benchmark different conversational models to find the best-performing one.
- Continuously improve the quality of AI interactions to meet user expectations.
Click here to read the eval definition of Conversation Resolution
a. Using Interface
Required Parameters
- messages: An array of conversation messages between the user and the assistant. This is a required parameter.
Configuration Parameters
- model: Specifies the LLM model to use for evaluation. This is a required configuration parameter.
b. Using SDK
2. Conversation Coherence
This eval determines whether a conversation flows logically, stays consistent with the context, and avoids abrupt or irrelevant shifts. It ensures that context Is maintained, responses are logical, and user experience is enhanced. Without coherence, conversations can become confusing or nonsensical, leading to user frustration and disengagement.
This evaluation is useful for:
- Testing the logical flow of chatbot or virtual assistant interactions.
- Benchmarking different conversational models for their ability to handle complex, multi-turn dialogues.
- Identifying breakdowns in context management or logical progression during AI interactions.
Click here to read the eval definition of Conversation Coherence
a. Using Interface
Required Inputs
- Messages: Column that contains the complete conversation, represented as an array of user and assistant messages.
- Model: The LLM used for evaluation, configurable based on the available models.
Configuration Parameters
- model: Specifies the LLM model to use for evaluation. This is a required configuration parameter.
This also returns an output score. A higher score reflects a logically consistent and contextually relevant conversation. A lower score indicates issues like abrupt topic shifts, irrelevant responses, or loss of context.