Evaluate Conversation

In the conversational AI, evaluating the quality of interaction is critical for improving the ability of dialogue system to generate meaningful and effective responses. These evaluations provide a systematic way to assess how well a model performs in maintaining the conversational flow, resolving user queries, and adhering to contextual relevance. This evaluation is needed in situation like testing chatbot or virtual assistant interactions to ensure they meet user needs. Monitoring deployed systems to identify where conversations fail to resolve or comparing models to determine which one performs better in resolving user queries. The primary input to this evaluation is a list of messages that make up the conversation. These messages are expected to be strings representing the dialogue between a user and an AI. The LLM agent sends the organised input to an LLM service, which analyses the interaction and provides a structured output. This output specifies whether each part of the interaction is resolved, unresolved, or partially resolved, along with explanations. The scoring pipeline calculates a score based on the resolution status of each message. Fully resolved messages contribute more to the score than partially resolved ones. The scoring pipeline calculates a score based on the resolution status of each message. The score is then compared against a failure threshold to determine if the conversation is considered a failure. The evaluation results are then return, which includes the score, failure status, reasons for unresolved messages, and other metadata. The following two evals focuses on critical aspects of conversational quality, addressing both the resolution of user queries and the logical flow of interactions:

Conversation Resolution
Conversation Coherence

1. Conversation Resolution

This evaluation examines a series of exchanged messages between the user and the AI system. It uses an LLM to analyse the interaction and determine whether the conversation reached a satisfactory resolution. The analysis results in a score representing the level of resolution achieved. Returns an output score. A higher score indicates that the conversation was resolved effectively. A lower score points to incomplete, unclear, or unresolved conversations. The Conversation Resolution evaluation enables developers to:

Spot unresolved interactions and improve system responses.
Benchmark different conversational models to find the best-performing one.
Continuously improve the quality of AI interactions to meet user expectations.

Click here to read the eval definition of Conversation Resolution

a. Using Interface

Required Parameters

output: An array of conversation messages between the user and the assistant. This is a required parameter.

b. Using SDK

Export your API key and Secret key into your environment variables.

result = evaluator.evaluate(
    eval_templates="conversation_resolution",
    inputs={
        "output": '''
                    User: My Wi-Fi keeps disconnecting every few minutes.
                    Assistant: You can try restarting your router and updating your network drivers.
                    User: I restarted the router and it's stable now. Thanks!
                    Assistant: Glad to hear that! Let me know if you need anything else.
                  '''
    },
    model_name="turing_flash"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

2. Conversation Coherence

This eval determines whether a conversation flows logically, stays consistent with the context, and avoids abrupt or irrelevant shifts. It ensures that context Is maintained, responses are logical, and user experience is enhanced. Without coherence, conversations can become confusing or nonsensical, leading to user frustration and disengagement. This evaluation is useful for:

Testing the logical flow of chatbot or virtual assistant interactions.
Benchmarking different conversational models for their ability to handle complex, multi-turn dialogues.
Identifying breakdowns in context management or logical progression during AI interactions.

Click here to read the eval definition of Conversation Coherence

a. Using Interface

Required Inputs

output: Column that contains the complete conversation, represented as an array of user and assistant messages.

This also returns an output score. A higher score reflects a logically consistent and contextually relevant conversation. A lower score indicates issues like abrupt topic shifts, irrelevant responses, or loss of context.

b. Using SDK

result = evaluator.evaluate(
    eval_templates="conversation_coherence",
    inputs={
        "output": '''
                    User: My Wi-Fi keeps disconnecting every few minutes.
                    Assistant: You can try restarting your router and updating your network drivers.
                    User: I restarted the router and it's stable now. Thanks!
                    Assistant: Glad to hear that! Let me know if you need anything else.
                  '''
    },
    model_name="turing_flash"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

Get Started

Guides

1. Conversation Resolution

a. Using Interface

b. Using SDK

2. Conversation Coherence

a. Using Interface

b. Using SDK

Get Started

Guides

​1. Conversation Resolution

​a. Using Interface

​b. Using SDK

​2. Conversation Coherence

​a. Using Interface

​b. Using SDK

1. Conversation Resolution

a. Using Interface

b. Using SDK

2. Conversation Coherence

a. Using Interface

b. Using SDK