Evaluating Customer Agent Multi-Turn Conversations

Score multi-turn conversations for quality, context retention, query handling, loop detection, and escalation using built-in conversation metrics.

📝

TL;DR

Score multi-turn customer support conversations for quality, context retention, loop detection, escalation handling, and prompt conformance using FutureAGI’s built-in conversation metrics.

Time	Difficulty	Package
15 min	Intermediate	`ai-evaluation`

By the end of this guide you will have scored a customer support conversation for overall quality, diagnosed specific failure modes like context loss and repetitive loops, checked whether the agent followed its system prompt, and run a full scorecard comparing a good conversation against a bad one.

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
Python 3.9+

Tip

You can also pass a valid audio URL instead of a JSON conversation string for any of these metrics. Use model_name="turing_large" when evaluating audio inputs.

Install

pip install ai-evaluation

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"

Set up the evaluator

import os
import json
from fi.evals import Evaluator

evaluator = Evaluator(
    fi_api_key=os.environ["FI_API_KEY"],
    fi_secret_key=os.environ["FI_SECRET_KEY"],
)

Define two conversations

A well-handled support call and a poorly-handled one. Both start with the same customer issue — a duplicate charge.

good_conversation = [
    {"role": "user", "content": "Hi, I was charged twice for order #7291. Can you help?"},
    {"role": "assistant", "content": "I'm sorry about that! Let me look up order #7291 right away."},
    {"role": "assistant", "content": "I can see the duplicate charge. I've initiated a refund of $49.99 to your original payment method. It should appear within 3-5 business days."},
    {"role": "user", "content": "Great, thank you. Will I get a confirmation email?"},
    {"role": "assistant", "content": "Yes, you'll receive a refund confirmation email within the next hour. Is there anything else I can help with?"},
    {"role": "user", "content": "No, that's all. Thanks for the quick help!"},
    {"role": "assistant", "content": "You're welcome! Don't hesitate to reach out if you need anything else. Have a great day!"},
]

bad_conversation = [
    {"role": "user", "content": "Hi, I was charged twice for order #7291. Can you help?"},
    {"role": "assistant", "content": "I'd be happy to help! Have you tried checking your account settings?"},
    {"role": "user", "content": "This isn't about account settings. I was charged $49.99 twice for the same order."},
    {"role": "assistant", "content": "I understand. Could you provide your order number?"},
    {"role": "user", "content": "I already said it's order #7291."},
    {"role": "assistant", "content": "Could you provide your order number so I can look into this?"},
    {"role": "user", "content": "I just told you twice — #7291. Can I speak to a manager?"},
    {"role": "assistant", "content": "I apologize for the inconvenience. Have you considered upgrading to our Premium support plan for faster resolution?"},
    {"role": "user", "content": "No! I just want my refund. Please transfer me to someone who can help."},
    {"role": "assistant", "content": "I understand your concern. Let me check your account settings to see if there's an issue."},
]

The bad conversation has multiple failure modes. It ignores the actual question, forgets the order number (asks twice), loops back to “account settings”, tries to upsell, and doesn’t escalate when the user asks for a manager.

Score overall conversation quality

customer_agent_conversation_quality rates the overall interaction on a 1–5 scale considering clarity, helpfulness, responsiveness, and tone.

Note

Choice-based metrics (quality, query handling, loop detection, escalation) return eval_result.output as a list (e.g., ['5']). Score-based metrics (coherence, resolution, context retention) return a plain number. In the snippets below, output[0] extracts the value when the output is a list.

for label, convo in [("Good", good_conversation), ("Bad", bad_conversation)]:
    result = evaluator.evaluate(
        eval_templates="customer_agent_conversation_quality",
        inputs={"conversation": json.dumps(convo)},
        model_name="turing_small",
    )
    eval_result = result.eval_results[0]
    score = eval_result.output[0] if isinstance(eval_result.output, list) else eval_result.output
    print(f"{label} conversation: {score}/5")
    print(f"  Reason: {eval_result.reason}\n")

Expected output:

Good conversation: 5/5
  Reason: The agent promptly addressed the issue, provided a clear resolution...

Bad conversation: 1/5
  Reason: The agent repeatedly ignored the customer's request, forgot context...

Diagnose why the bad conversation failed

Run targeted metrics on the bad conversation to pinpoint specific failure modes.

Context retention: Did the agent remember details from earlier in the conversation?

result = evaluator.evaluate(
    eval_templates="customer_agent_context_retention",
    inputs={"conversation": json.dumps(bad_conversation)},
    model_name="turing_small",
)
eval_result = result.eval_results[0]
print(f"Context retention: {eval_result.output}")
print(f"Reason: {eval_result.reason}")

Query handling: Did the agent correctly interpret and answer the user’s questions?

result = evaluator.evaluate(
    eval_templates="customer_agent_query_handling",
    inputs={"conversation": json.dumps(bad_conversation)},
    model_name="turing_small",
)
eval_result = result.eval_results[0]
score = eval_result.output[0] if isinstance(eval_result.output, list) else eval_result.output
print(f"Query handling: {score}")
print(f"Reason: {eval_result.reason}")

Loop detection: Did the agent get stuck repeating the same prompts?

result = evaluator.evaluate(
    eval_templates="customer_agent_loop_detection",
    inputs={"conversation": json.dumps(bad_conversation)},
    model_name="turing_small",
)
eval_result = result.eval_results[0]
score = eval_result.output[0] if isinstance(eval_result.output, list) else eval_result.output
print(f"Loop detection: {score}")
print(f"Reason: {eval_result.reason}")

Human escalation: Did the agent escalate when the user asked for a manager?

result = evaluator.evaluate(
    eval_templates="customer_agent_human_escalation",
    inputs={"conversation": json.dumps(bad_conversation)},
    model_name="turing_small",
)
eval_result = result.eval_results[0]
score = eval_result.output[0] if isinstance(eval_result.output, list) else eval_result.output
print(f"Human escalation: {score}")
print(f"Reason: {eval_result.reason}")

Expected output:

Context retention: 12
Reason: The agent asked for the order number twice despite the user providing it...

Query handling: never
Reason: The agent never directly addressed the duplicate charge issue...

Loop detection: frequently
Reason: The agent circled back to "account settings" twice and asked for the order number twice...

Human escalation: Failed
Reason: The user explicitly requested a manager but the agent deflected with an upsell...

Each metric catches a different dimension of failure. Together they reveal a clear story. The agent forgot context, ignored the question, looped, and refused to escalate.

Evaluate prompt conformance

customer_agent_prompt_conformance checks whether the agent followed its system prompt throughout the conversation. This is the only conversation metric that takes an additional system_prompt input.

system_prompt = (
    "You are a billing support agent for TechStore. "
    "Your role is to help customers resolve payment and billing issues. "
    "Always acknowledge the customer's issue first, then investigate. "
    "Never upsell products during a support interaction. "
    "If a customer asks to speak with a manager, escalate immediately."
)

for label, convo in [("Good", good_conversation), ("Bad", bad_conversation)]:
    result = evaluator.evaluate(
        eval_templates="customer_agent_prompt_conformance",
        inputs={
            "system_prompt": system_prompt,
            "conversation": json.dumps(convo),
        },
        model_name="turing_small",
    )
    eval_result = result.eval_results[0]
    score = eval_result.output[0] if isinstance(eval_result.output, list) else eval_result.output
    print(f"{label} conversation - prompt conformance: {score}")
    print(f"  Reason: {eval_result.reason}\n")

Expected output:

Good conversation — prompt conformance: 95
  Reason: The agent acknowledged the issue, investigated, and resolved it...

Bad conversation — prompt conformance: 8
  Reason: The agent violated multiple system prompt rules: upsold a product, failed to escalate...

Full scorecard

Run all key metrics on both conversations in a single diagnostic sweep.

metrics = [
    ("conversation_coherence", "Coherence", "score"),
    ("conversation_resolution", "Resolution", "score"),
    ("customer_agent_conversation_quality", "Quality", "choice"),
    ("customer_agent_context_retention", "Context", "score"),
    ("customer_agent_query_handling", "Queries", "choice"),
    ("customer_agent_loop_detection", "Loops", "choice"),
    ("customer_agent_human_escalation", "Escalation", "passfail"),
]

print(f"{'Metric':<14}  {'Good':>12}  {'Bad':>12}")
print("-" * 42)

for metric_name, label, _ in metrics:
    good_result = evaluator.evaluate(
        eval_templates=metric_name,
        inputs={"conversation": json.dumps(good_conversation)},
        model_name="turing_small",
    )
    bad_result = evaluator.evaluate(
        eval_templates=metric_name,
        inputs={"conversation": json.dumps(bad_conversation)},
        model_name="turing_small",
    )
    good_raw = good_result.eval_results[0].output
    bad_raw = bad_result.eval_results[0].output
    good_val = good_raw[0] if isinstance(good_raw, list) else good_raw
    bad_val = bad_raw[0] if isinstance(bad_raw, list) else bad_raw
    print(f"{label:<14}  {str(good_val):>12}  {str(bad_val):>12}")

Expected output:

Metric            Good           Bad
------------------------------------------
Coherence          1.0           0.4
Resolution         1.0           0.0
Quality            5/5           1/5
Context            95             12
Queries         always         never
Loops            never     frequently
Escalation      Passed        Failed

The good conversation passes every metric. The bad conversation fails across the board: low coherence, unresolved, poor quality, lost context, never handled queries correctly, frequently looped, and failed to escalate.

Run the eval group from the dashboard

You can run all conversational agent metrics at once from the dashboard using the Conversational agent evaluation eval group — no code required.

Go to app.futureagi.com → Dataset
Open a dataset that has a conversation column (JSON array of role/content messages) and a system_prompt column containing the agent’s system prompt
Click Evaluate → Add Evaluations
Under Groups, select Conversational agent evaluation — this adds all 13 metrics in one click
Map the conversation column to the conversation input, and the system_prompt column to the system prompt input — this is needed for customer_agent_prompt_conformance, which checks whether the agent followed its instructions
Click Add & Run

All metrics run in parallel. Scores appear as new columns alongside your data — one column per metric. Most metrics only need the conversation column; the system_prompt mapping is used by customer_agent_prompt_conformance and ignored by the rest.

Tip

Eval groups are also available in Simulation workflows — select the group when configuring a simulation and all metrics run automatically on every completed conversation.

Eval reference

Eval	Input keys	Output	What it catches
`conversation_coherence`	`conversation`	Score (0–1)	Topic drift, contradictions, non-sequiturs
`conversation_resolution`	`conversation`	Score (0–1)	Unresolved issues, abandoned threads
`customer_agent_conversation_quality`	`conversation`	Choice (1–5)	Overall quality: clarity, helpfulness, tone
`customer_agent_context_retention`	`conversation`	Score (0–100)	Forgetting earlier details
`customer_agent_query_handling`	`conversation`	Choice (never–always)	Misinterpreting or ignoring questions
`customer_agent_loop_detection`	`conversation`	Choice (never–always)	Repetitive prompts, circular behavior
`customer_agent_human_escalation`	`conversation`	Pass or Fail	Failure to escalate when needed
`customer_agent_prompt_conformance`	`system_prompt`, `conversation`	Score (0–100)	Deviating from persona or guidelines
`customer_agent_objection_handling`	`conversation`	Choice (never–always)	Handling customer pushback
`customer_agent_clarification_seeking`	`conversation`	Choice (never–always)	Asking for clarification vs guessing
`customer_agent_termination_handling`	`conversation`	Choice (never–always)	Abrupt hang-ups, crashes, early cut-offs
`customer_agent_interruption_handling`	`conversation`	Score (0–100)	Recovering smoothly after interruptions
`customer_agent_language_handling`	`conversation`	Score (0–100)	Language/dialect detection and switching

What you built

You can now evaluate multi-turn customer support conversations across quality, failure diagnosis, prompt conformance, and comparative scorecards using built-in Turing metrics.

Scored a customer support conversation for overall quality with customer_agent_conversation_quality
Diagnosed specific failure modes: context loss, poor query handling, repetitive loops, and missed escalation
Checked whether the agent followed its system prompt with customer_agent_prompt_conformance
Ran a full scorecard comparing a good conversation against a bad one across 7 metrics
Used the Conversational agent evaluation group to run all 13 metrics on a dataset from the dashboard

Questions & Discussion

Evaluating Customer Agent Multi-Turn Conversations

Install

Set up the evaluator

Define two conversations

Score overall conversation quality

Diagnose why the bad conversation failed

Evaluate prompt conformance

Full scorecard

Run the eval group from the dashboard

Eval reference

What you built

Next steps

Chat Simulation with Personas

Session Observability

Batch Evaluation

All Built-in Metrics