Evaluate Customer Agent Conversations
Score multi-turn conversations for quality, context retention, query handling, loop detection, escalation, and prompt conformance using built-in Turing metrics.
Score multi-turn customer support conversations for quality, context retention, loop detection, escalation handling, and prompt conformance using FutureAGI’s built-in conversation metrics.
| Time | Difficulty | Package |
|---|---|---|
| 15 min | Intermediate | ai-evaluation |
By the end of this guide you will have scored a customer support conversation for overall quality, diagnosed specific failure modes like context loss and repetitive loops, checked whether the agent followed its system prompt, and run a full scorecard comparing a good conversation against a bad one.
- FutureAGI account → app.futureagi.com
- API keys:
FI_API_KEYandFI_SECRET_KEY(see Get your API keys) - Python 3.9+
Tip
You can also pass a valid audio URL instead of a JSON conversation string for any of these metrics. Use model_name="turing_large" when evaluating audio inputs.
Install
pip install ai-evaluation
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
Set up the evaluator
import os
import json
from fi.evals import Evaluator
evaluator = Evaluator(
fi_api_key=os.environ["FI_API_KEY"],
fi_secret_key=os.environ["FI_SECRET_KEY"],
) Define two conversations
A well-handled support call and a poorly-handled one. Both start with the same customer issue — a duplicate charge.
good_conversation = [
{"role": "user", "content": "Hi, I was charged twice for order #7291. Can you help?"},
{"role": "assistant", "content": "I'm sorry about that! Let me look up order #7291 right away."},
{"role": "assistant", "content": "I can see the duplicate charge. I've initiated a refund of $49.99 to your original payment method. It should appear within 3-5 business days."},
{"role": "user", "content": "Great, thank you. Will I get a confirmation email?"},
{"role": "assistant", "content": "Yes, you'll receive a refund confirmation email within the next hour. Is there anything else I can help with?"},
{"role": "user", "content": "No, that's all. Thanks for the quick help!"},
{"role": "assistant", "content": "You're welcome! Don't hesitate to reach out if you need anything else. Have a great day!"},
]
bad_conversation = [
{"role": "user", "content": "Hi, I was charged twice for order #7291. Can you help?"},
{"role": "assistant", "content": "I'd be happy to help! Have you tried checking your account settings?"},
{"role": "user", "content": "This isn't about account settings. I was charged $49.99 twice for the same order."},
{"role": "assistant", "content": "I understand. Could you provide your order number?"},
{"role": "user", "content": "I already said it's order #7291."},
{"role": "assistant", "content": "Could you provide your order number so I can look into this?"},
{"role": "user", "content": "I just told you twice — #7291. Can I speak to a manager?"},
{"role": "assistant", "content": "I apologize for the inconvenience. Have you considered upgrading to our Premium support plan for faster resolution?"},
{"role": "user", "content": "No! I just want my refund. Please transfer me to someone who can help."},
{"role": "assistant", "content": "I understand your concern. Let me check your account settings to see if there's an issue."},
]The bad conversation has multiple failure modes. It ignores the actual question, forgets the order number (asks twice), loops back to “account settings”, tries to upsell, and doesn’t escalate when the user asks for a manager.
Score overall conversation quality
customer_agent_conversation_quality rates the overall interaction on a 1–5 scale considering clarity, helpfulness, responsiveness, and tone.
Note
Choice-based metrics (quality, query handling, loop detection, escalation) return eval_result.output as a list (e.g., ['5']). Score-based metrics (coherence, resolution, context retention) return a plain number. In the snippets below, output[0] extracts the value when the output is a list.
for label, convo in [("Good", good_conversation), ("Bad", bad_conversation)]:
result = evaluator.evaluate(
eval_templates="customer_agent_conversation_quality",
inputs={"conversation": json.dumps(convo)},
model_name="turing_small",
)
eval_result = result.eval_results[0]
score = eval_result.output[0] if isinstance(eval_result.output, list) else eval_result.output
print(f"{label} conversation: {score}/5")
print(f" Reason: {eval_result.reason}\n")Expected output:
Good conversation: 5/5
Reason: The agent promptly addressed the issue, provided a clear resolution...
Bad conversation: 1/5
Reason: The agent repeatedly ignored the customer's request, forgot context... Diagnose why the bad conversation failed
Run targeted metrics on the bad conversation to pinpoint specific failure modes.
Context retention: Did the agent remember details from earlier in the conversation?
result = evaluator.evaluate(
eval_templates="customer_agent_context_retention",
inputs={"conversation": json.dumps(bad_conversation)},
model_name="turing_small",
)
eval_result = result.eval_results[0]
print(f"Context retention: {eval_result.output}")
print(f"Reason: {eval_result.reason}")Query handling: Did the agent correctly interpret and answer the user’s questions?
result = evaluator.evaluate(
eval_templates="customer_agent_query_handling",
inputs={"conversation": json.dumps(bad_conversation)},
model_name="turing_small",
)
eval_result = result.eval_results[0]
score = eval_result.output[0] if isinstance(eval_result.output, list) else eval_result.output
print(f"Query handling: {score}")
print(f"Reason: {eval_result.reason}")Loop detection: Did the agent get stuck repeating the same prompts?
result = evaluator.evaluate(
eval_templates="customer_agent_loop_detection",
inputs={"conversation": json.dumps(bad_conversation)},
model_name="turing_small",
)
eval_result = result.eval_results[0]
score = eval_result.output[0] if isinstance(eval_result.output, list) else eval_result.output
print(f"Loop detection: {score}")
print(f"Reason: {eval_result.reason}")Human escalation: Did the agent escalate when the user asked for a manager?
result = evaluator.evaluate(
eval_templates="customer_agent_human_escalation",
inputs={"conversation": json.dumps(bad_conversation)},
model_name="turing_small",
)
eval_result = result.eval_results[0]
score = eval_result.output[0] if isinstance(eval_result.output, list) else eval_result.output
print(f"Human escalation: {score}")
print(f"Reason: {eval_result.reason}")Expected output:
Context retention: 12
Reason: The agent asked for the order number twice despite the user providing it...
Query handling: never
Reason: The agent never directly addressed the duplicate charge issue...
Loop detection: frequently
Reason: The agent circled back to "account settings" twice and asked for the order number twice...
Human escalation: Failed
Reason: The user explicitly requested a manager but the agent deflected with an upsell...Each metric catches a different dimension of failure. Together they reveal a clear story. The agent forgot context, ignored the question, looped, and refused to escalate.
Evaluate prompt conformance
customer_agent_prompt_conformance checks whether the agent followed its system prompt throughout the conversation. This is the only conversation metric that takes an additional system_prompt input.
system_prompt = (
"You are a billing support agent for TechStore. "
"Your role is to help customers resolve payment and billing issues. "
"Always acknowledge the customer's issue first, then investigate. "
"Never upsell products during a support interaction. "
"If a customer asks to speak with a manager, escalate immediately."
)
for label, convo in [("Good", good_conversation), ("Bad", bad_conversation)]:
result = evaluator.evaluate(
eval_templates="customer_agent_prompt_conformance",
inputs={
"system_prompt": system_prompt,
"conversation": json.dumps(convo),
},
model_name="turing_small",
)
eval_result = result.eval_results[0]
score = eval_result.output[0] if isinstance(eval_result.output, list) else eval_result.output
print(f"{label} conversation - prompt conformance: {score}")
print(f" Reason: {eval_result.reason}\n")Expected output:
Good conversation — prompt conformance: 95
Reason: The agent acknowledged the issue, investigated, and resolved it...
Bad conversation — prompt conformance: 8
Reason: The agent violated multiple system prompt rules: upsold a product, failed to escalate... Full scorecard
Run all key metrics on both conversations in a single diagnostic sweep.
metrics = [
("conversation_coherence", "Coherence", "score"),
("conversation_resolution", "Resolution", "score"),
("customer_agent_conversation_quality", "Quality", "choice"),
("customer_agent_context_retention", "Context", "score"),
("customer_agent_query_handling", "Queries", "choice"),
("customer_agent_loop_detection", "Loops", "choice"),
("customer_agent_human_escalation", "Escalation", "passfail"),
]
print(f"{'Metric':<14} {'Good':>12} {'Bad':>12}")
print("-" * 42)
for metric_name, label, _ in metrics:
good_result = evaluator.evaluate(
eval_templates=metric_name,
inputs={"conversation": json.dumps(good_conversation)},
model_name="turing_small",
)
bad_result = evaluator.evaluate(
eval_templates=metric_name,
inputs={"conversation": json.dumps(bad_conversation)},
model_name="turing_small",
)
good_raw = good_result.eval_results[0].output
bad_raw = bad_result.eval_results[0].output
good_val = good_raw[0] if isinstance(good_raw, list) else good_raw
bad_val = bad_raw[0] if isinstance(bad_raw, list) else bad_raw
print(f"{label:<14} {str(good_val):>12} {str(bad_val):>12}")Expected output:
Metric Good Bad
------------------------------------------
Coherence 1.0 0.4
Resolution 1.0 0.0
Quality 5/5 1/5
Context 95 12
Queries always never
Loops never frequently
Escalation Passed FailedThe good conversation passes every metric. The bad conversation fails across the board: low coherence, unresolved, poor quality, lost context, never handled queries correctly, frequently looped, and failed to escalate.
Run the eval group from the dashboard
You can run all conversational agent metrics at once from the dashboard using the Conversational agent evaluation eval group — no code required.
- Go to app.futureagi.com → Dataset
- Open a dataset that has a
conversationcolumn (JSON array ofrole/contentmessages) and asystem_promptcolumn containing the agent’s system prompt - Click Evaluate → Add Evaluations
- Under Groups, select Conversational agent evaluation — this adds all 13 metrics in one click
- Map the
conversationcolumn to the conversation input, and thesystem_promptcolumn to the system prompt input — this is needed forcustomer_agent_prompt_conformance, which checks whether the agent followed its instructions - Click Add & Run
All metrics run in parallel. Scores appear as new columns alongside your data — one column per metric. Most metrics only need the conversation column; the system_prompt mapping is used by customer_agent_prompt_conformance and ignored by the rest.
Tip
Eval groups are also available in Simulation workflows — select the group when configuring a simulation and all metrics run automatically on every completed conversation.
Eval reference
| Eval | Input keys | Output | What it catches |
|---|---|---|---|
conversation_coherence | conversation | Score (0–1) | Topic drift, contradictions, non-sequiturs |
conversation_resolution | conversation | Score (0–1) | Unresolved issues, abandoned threads |
customer_agent_conversation_quality | conversation | Choice (1–5) | Overall quality: clarity, helpfulness, tone |
customer_agent_context_retention | conversation | Score (0–100) | Forgetting earlier details |
customer_agent_query_handling | conversation | Choice (never–always) | Misinterpreting or ignoring questions |
customer_agent_loop_detection | conversation | Choice (never–always) | Repetitive prompts, circular behavior |
customer_agent_human_escalation | conversation | Pass or Fail | Failure to escalate when needed |
customer_agent_prompt_conformance | system_prompt, conversation | Score (0–100) | Deviating from persona or guidelines |
customer_agent_objection_handling | conversation | Choice (never–always) | Handling customer pushback |
customer_agent_clarification_seeking | conversation | Choice (never–always) | Asking for clarification vs guessing |
customer_agent_termination_handling | conversation | Choice (never–always) | Abrupt hang-ups, crashes, early cut-offs |
customer_agent_interruption_handling | conversation | Score (0–100) | Recovering smoothly after interruptions |
customer_agent_language_handling | conversation | Score (0–100) | Language/dialect detection and switching |
What you built
You can now evaluate multi-turn customer support conversations across quality, failure diagnosis, prompt conformance, and comparative scorecards using built-in Turing metrics.
- Scored a customer support conversation for overall quality with
customer_agent_conversation_quality - Diagnosed specific failure modes: context loss, poor query handling, repetitive loops, and missed escalation
- Checked whether the agent followed its system prompt with
customer_agent_prompt_conformance - Ran a full scorecard comparing a good conversation against a bad one across 7 metrics
- Used the Conversational agent evaluation group to run all 13 metrics on a dataset from the dashboard