RAG Pipeline Evaluation: Debug Retrieval vs Generation
Score retrieval quality and generation quality independently to pinpoint whether your RAG pipeline is failing at retrieval or generation.
Score retrieval quality and generation quality independently with five metrics in a single evaluate() call to pinpoint whether your RAG pipeline fails at retrieval or generation.
| Time | Difficulty | Package |
|---|---|---|
| 15 min | Intermediate | ai-evaluation |
- FutureAGI account → app.futureagi.com
- API keys:
FI_API_KEYandFI_SECRET_KEY(see Get your API keys) - Python 3.9+
Install
pip install ai-evaluation
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
RAG evaluation metrics at a glance
A RAG pipeline has two stages that can fail independently: retrieval (did you fetch the right chunks?) and generation (did the LLM use those chunks correctly?). These five metrics help you isolate the problem.
| Metric | Stage | Required keys | Output type | What it measures |
|---|---|---|---|---|
context_relevance | Retrieval | context, input | score | Are the retrieved chunks relevant to the query? |
chunk_attribution | Retrieval | context, output | Pass/Fail | Was the context chunk used in generating the response? |
chunk_utilization | Retrieval | context, output | score | How effectively does the response use the context chunks? |
completeness | Generation | input, output | score | Does the response fully address all parts of the query? |
factual_accuracy | Generation | output, context; input optional | score | Are the facts in the output correct? |
For hallucination-specific metrics (faithfulness, groundedness, and context_adherence), see Hallucination Detection.
Set up a RAG test case
Define a realistic query, retrieved context chunks, and generated answer. This example simulates a company knowledge-base RAG system.
query = "What is the refund policy and how long does processing take?"
retrieved_context = (
"Chunk 1: Customers may request a full refund within 30 days of purchase. "
"Refunds are processed within 5-7 business days after approval. "
"Chunk 2: To initiate a refund, contact support@example.com with your order number. "
"Chunk 3: Gift cards and promotional items are non-refundable. "
"Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)
generated_answer = (
"You can request a full refund within 30 days of purchase. "
"Once approved, refunds are processed in 5-7 business days. "
"To start, email support@example.com with your order number. "
"Gift cards and promotional items cannot be refunded."
)Note that Chunk 4 is irrelevant to the query — a common retrieval problem. The metrics below will surface this.
Score retrieval quality
These three metrics evaluate whether your retriever fetched the right chunks and whether the LLM actually used them.
from fi.evals import evaluate
query = "What is the refund policy and how long does processing take?"
retrieved_context = (
"Chunk 1: Customers may request a full refund within 30 days of purchase. "
"Refunds are processed within 5-7 business days after approval. "
"Chunk 2: To initiate a refund, contact support@example.com with your order number. "
"Chunk 3: Gift cards and promotional items are non-refundable. "
"Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)
generated_answer = (
"You can request a full refund within 30 days of purchase. "
"Once approved, refunds are processed in 5-7 business days. "
"To start, email support@example.com with your order number. "
"Gift cards and promotional items cannot be refunded."
)
# Context relevance — are the retrieved chunks relevant to the query?
# Required: context, input
relevance = evaluate(
"context_relevance",
context=retrieved_context,
input=query,
model="turing_small",
)
print(f"Context Relevance : score={relevance.score} passed={relevance.passed}")
print(f" Reason: {relevance.reason}\n")
# Chunk attribution — was the context chunk used in the response?
# Required: context, output
attribution = evaluate(
"chunk_attribution",
output=generated_answer,
context=retrieved_context,
model="turing_small",
)
print(f"Chunk Attribution : score={attribution.score} passed={attribution.passed}")
print(f" Reason: {attribution.reason}\n")
# Chunk utilization — how effectively does the response use the context?
# Required: context, output
utilization = evaluate(
"chunk_utilization",
output=generated_answer,
context=retrieved_context,
model="turing_small",
)
print(f"Chunk Utilization : score={utilization.score} passed={utilization.passed}")
print(f" Reason: {utilization.reason}\n")Expected output (scores may vary):
Context Relevance : score=0.75 passed=True
Reason: Three of four chunks are relevant to the query; Chunk 4 is unrelated.
Chunk Attribution : score=Passed passed=True
Reason: Every claim in the output maps to a specific context chunk.
Chunk Utilization : score=0.75 passed=True
Reason: The output uses content from 3 of 4 retrieved chunks.Tip
Low context_relevance or chunk_utilization with high chunk_attribution means your retriever is fetching irrelevant chunks. Fix your embedding model or retrieval logic. High relevance but low attribution means the LLM is generating claims not grounded in any chunk.
Score generation quality
These metrics evaluate whether the LLM fully addressed the query and produced factually accurate claims.
from fi.evals import evaluate
query = "What is the refund policy and how long does processing take?"
retrieved_context = (
"Chunk 1: Customers may request a full refund within 30 days of purchase. "
"Refunds are processed within 5-7 business days after approval. "
"Chunk 2: To initiate a refund, contact support@example.com with your order number. "
"Chunk 3: Gift cards and promotional items are non-refundable. "
"Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)
generated_answer = (
"You can request a full refund within 30 days of purchase. "
"Once approved, refunds are processed in 5-7 business days. "
"To start, email support@example.com with your order number. "
"Gift cards and promotional items cannot be refunded."
)
# Completeness — does the response fully address the query?
# Required: input, output
completeness = evaluate(
"completeness",
input=query,
output=generated_answer,
model="turing_small",
)
print(f"Completeness : score={completeness.score} passed={completeness.passed}")
print(f" Reason: {completeness.reason}\n")
# Factual accuracy — are the facts in the output correct?
# Required: output, context; input optional
accuracy = evaluate(
"factual_accuracy",
input=query,
output=generated_answer,
context=retrieved_context,
model="turing_small",
)
print(f"Factual Accuracy : score={accuracy.score} passed={accuracy.passed}")
print(f" Reason: {accuracy.reason}\n")Expected output (scores may vary):
Completeness : score=1.0 passed=True
Reason: The response fully addresses the query including refund eligibility, processing time, and exceptions.
Factual Accuracy : score=1.0 passed=True
Reason: All stated facts are accurate and confirmed by the provided context. Run all metrics for a diagnostic overview
Each RAG metric requires different input keys, so group them by required keys.
from fi.evals import evaluate
query = "What is the refund policy and how long does processing take?"
retrieved_context = (
"Chunk 1: Customers may request a full refund within 30 days of purchase. "
"Refunds are processed within 5-7 business days after approval. "
"Chunk 2: To initiate a refund, contact support@example.com with your order number. "
"Chunk 3: Gift cards and promotional items are non-refundable. "
"Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)
generated_answer = (
"You can request a full refund within 30 days of purchase. "
"Once approved, refunds are processed in 5-7 business days. "
"To start, email support@example.com with your order number. "
"Gift cards and promotional items cannot be refunded."
)
# Group 1: context + input
relevance = evaluate(
"context_relevance",
context=retrieved_context,
input=query,
model="turing_small",
)
# Group 2: context + output
retrieval_scores = evaluate(
["chunk_attribution", "chunk_utilization"],
context=retrieved_context,
output=generated_answer,
model="turing_small",
)
# Group 3: input + output (completeness) and input + output + context (factual_accuracy)
completeness = evaluate(
"completeness",
input=query,
output=generated_answer,
model="turing_small",
)
accuracy = evaluate(
"factual_accuracy",
input=query,
output=generated_answer,
context=retrieved_context,
model="turing_small",
)
# Merge all results
all_results = [relevance] + list(retrieval_scores) + [completeness, accuracy]
print("=== RAG Pipeline Diagnostic ===\n")
for r in all_results:
status = "PASS" if r.passed else "FAIL"
print(f"{r.eval_name:<22} score={str(r.score):<10} {status}")
print(f" Reason: {r.reason}\n")Expected output (scores may vary):
=== RAG Pipeline Diagnostic ===
context_relevance score=0.75 PASS
Reason: Three of four chunks are relevant; Chunk 4 is off-topic.
chunk_attribution score=Passed PASS
Reason: Every output claim maps to a specific context chunk.
chunk_utilization score=0.75 PASS
Reason: Output uses 3 of 4 chunks; Chunk 4 is unused.
completeness score=1.0 PASS
Reason: The response fully addresses all parts of the query.
factual_accuracy score=1.0 PASS
Reason: All stated facts are accurate and confirmed by the context. Interpret results — retrieval problem or generation problem?
Use the diagnostic output to decide where to focus your effort:
| Pattern | Diagnosis | Fix |
|---|---|---|
Low context_relevance + low chunk_utilization | Retriever fetches irrelevant chunks | Improve embeddings, re-rank, or tune top-k |
High context_relevance + low chunk_attribution | LLM fabricates claims beyond the context | Add grounding instructions to the system prompt |
High context_relevance + low completeness | LLM doesn’t fully address the query | Restructure the prompt to cover all parts of the question |
High context_relevance + low factual_accuracy | LLM distorts facts from the context | Switch to a more capable model or reduce temperature |
| All high | Pipeline is working well | Monitor over time for regressions |
# Quick decision logic you can add to a CI pipeline
scores = {r.eval_name: r for r in all_results}
retrieval_ok = (
scores["context_relevance"].passed
and scores["chunk_utilization"].passed
)
generation_ok = (
scores["completeness"].passed
and scores["factual_accuracy"].passed
)
if not retrieval_ok:
print("Action: Improve retrieval — check embeddings, re-ranking, or top-k settings.")
elif not generation_ok:
print("Action: Improve generation — tune the prompt, lower temperature, or switch models.")
else:
print("Pipeline healthy.")Note
For deeper hallucination analysis (checking whether the output contradicts or drifts from the context), combine these metrics with faithfulness and groundedness from Hallucination Detection.
What you built
You can now evaluate any RAG pipeline end-to-end, isolating retrieval failures from generation failures with five targeted metrics and a diagnostic decision framework.
- Scored retrieval quality with
context_relevance,chunk_attribution, andchunk_utilizationto check whether the right chunks were fetched and used - Scored generation quality with
completeness(did the response address the full query?) andfactual_accuracy(are the facts correct?) - Ran all five metrics grouped by required input keys for a full RAG diagnostic
- Built a decision framework to isolate retrieval failures from generation failures