RAG Pipeline Evaluation: Debug Retrieval vs Generation

Score retrieval quality and generation quality independently to pinpoint whether your RAG pipeline is failing at retrieval or generation.

📝
TL;DR

Score retrieval quality and generation quality independently with five metrics in a single evaluate() call to pinpoint whether your RAG pipeline fails at retrieval or generation.

Open in ColabGitHub
TimeDifficultyPackage
15 minIntermediateai-evaluation
Prerequisites

Install

pip install ai-evaluation
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"

RAG evaluation metrics at a glance

A RAG pipeline has two stages that can fail independently: retrieval (did you fetch the right chunks?) and generation (did the LLM use those chunks correctly?). These five metrics help you isolate the problem.

MetricStageRequired keysOutput typeWhat it measures
context_relevanceRetrievalcontext, inputscoreAre the retrieved chunks relevant to the query?
chunk_attributionRetrievalcontext, outputPass/FailWas the context chunk used in generating the response?
chunk_utilizationRetrievalcontext, outputscoreHow effectively does the response use the context chunks?
completenessGenerationinput, outputscoreDoes the response fully address all parts of the query?
factual_accuracyGenerationoutput, context; input optionalscoreAre the facts in the output correct?

For hallucination-specific metrics (faithfulness, groundedness, and context_adherence), see Hallucination Detection.

Set up a RAG test case

Define a realistic query, retrieved context chunks, and generated answer. This example simulates a company knowledge-base RAG system.

query = "What is the refund policy and how long does processing take?"

retrieved_context = (
    "Chunk 1: Customers may request a full refund within 30 days of purchase. "
    "Refunds are processed within 5-7 business days after approval. "
    "Chunk 2: To initiate a refund, contact support@example.com with your order number. "
    "Chunk 3: Gift cards and promotional items are non-refundable. "
    "Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)

generated_answer = (
    "You can request a full refund within 30 days of purchase. "
    "Once approved, refunds are processed in 5-7 business days. "
    "To start, email support@example.com with your order number. "
    "Gift cards and promotional items cannot be refunded."
)

Note that Chunk 4 is irrelevant to the query — a common retrieval problem. The metrics below will surface this.

Score retrieval quality

These three metrics evaluate whether your retriever fetched the right chunks and whether the LLM actually used them.

from fi.evals import evaluate

query = "What is the refund policy and how long does processing take?"

retrieved_context = (
    "Chunk 1: Customers may request a full refund within 30 days of purchase. "
    "Refunds are processed within 5-7 business days after approval. "
    "Chunk 2: To initiate a refund, contact support@example.com with your order number. "
    "Chunk 3: Gift cards and promotional items are non-refundable. "
    "Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)

generated_answer = (
    "You can request a full refund within 30 days of purchase. "
    "Once approved, refunds are processed in 5-7 business days. "
    "To start, email support@example.com with your order number. "
    "Gift cards and promotional items cannot be refunded."
)

# Context relevance — are the retrieved chunks relevant to the query?
# Required: context, input
relevance = evaluate(
    "context_relevance",
    context=retrieved_context,
    input=query,
    model="turing_small",
)
print(f"Context Relevance : score={relevance.score}  passed={relevance.passed}")
print(f"  Reason: {relevance.reason}\n")

# Chunk attribution — was the context chunk used in the response?
# Required: context, output
attribution = evaluate(
    "chunk_attribution",
    output=generated_answer,
    context=retrieved_context,
    model="turing_small",
)
print(f"Chunk Attribution : score={attribution.score}  passed={attribution.passed}")
print(f"  Reason: {attribution.reason}\n")

# Chunk utilization — how effectively does the response use the context?
# Required: context, output
utilization = evaluate(
    "chunk_utilization",
    output=generated_answer,
    context=retrieved_context,
    model="turing_small",
)
print(f"Chunk Utilization : score={utilization.score}  passed={utilization.passed}")
print(f"  Reason: {utilization.reason}\n")

Expected output (scores may vary):

Context Relevance : score=0.75  passed=True
  Reason: Three of four chunks are relevant to the query; Chunk 4 is unrelated.

Chunk Attribution : score=Passed  passed=True
  Reason: Every claim in the output maps to a specific context chunk.

Chunk Utilization : score=0.75  passed=True
  Reason: The output uses content from 3 of 4 retrieved chunks.

Tip

Low context_relevance or chunk_utilization with high chunk_attribution means your retriever is fetching irrelevant chunks. Fix your embedding model or retrieval logic. High relevance but low attribution means the LLM is generating claims not grounded in any chunk.

Score generation quality

These metrics evaluate whether the LLM fully addressed the query and produced factually accurate claims.

from fi.evals import evaluate

query = "What is the refund policy and how long does processing take?"

retrieved_context = (
    "Chunk 1: Customers may request a full refund within 30 days of purchase. "
    "Refunds are processed within 5-7 business days after approval. "
    "Chunk 2: To initiate a refund, contact support@example.com with your order number. "
    "Chunk 3: Gift cards and promotional items are non-refundable. "
    "Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)

generated_answer = (
    "You can request a full refund within 30 days of purchase. "
    "Once approved, refunds are processed in 5-7 business days. "
    "To start, email support@example.com with your order number. "
    "Gift cards and promotional items cannot be refunded."
)

# Completeness — does the response fully address the query?
# Required: input, output
completeness = evaluate(
    "completeness",
    input=query,
    output=generated_answer,
    model="turing_small",
)
print(f"Completeness     : score={completeness.score}  passed={completeness.passed}")
print(f"  Reason: {completeness.reason}\n")

# Factual accuracy — are the facts in the output correct?
# Required: output, context; input optional
accuracy = evaluate(
    "factual_accuracy",
    input=query,
    output=generated_answer,
    context=retrieved_context,
    model="turing_small",
)
print(f"Factual Accuracy : score={accuracy.score}  passed={accuracy.passed}")
print(f"  Reason: {accuracy.reason}\n")

Expected output (scores may vary):

Completeness     : score=1.0  passed=True
  Reason: The response fully addresses the query including refund eligibility, processing time, and exceptions.

Factual Accuracy : score=1.0  passed=True
  Reason: All stated facts are accurate and confirmed by the provided context.

Run all metrics for a diagnostic overview

Each RAG metric requires different input keys, so group them by required keys.

from fi.evals import evaluate

query = "What is the refund policy and how long does processing take?"

retrieved_context = (
    "Chunk 1: Customers may request a full refund within 30 days of purchase. "
    "Refunds are processed within 5-7 business days after approval. "
    "Chunk 2: To initiate a refund, contact support@example.com with your order number. "
    "Chunk 3: Gift cards and promotional items are non-refundable. "
    "Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)

generated_answer = (
    "You can request a full refund within 30 days of purchase. "
    "Once approved, refunds are processed in 5-7 business days. "
    "To start, email support@example.com with your order number. "
    "Gift cards and promotional items cannot be refunded."
)

# Group 1: context + input
relevance = evaluate(
    "context_relevance",
    context=retrieved_context,
    input=query,
    model="turing_small",
)

# Group 2: context + output
retrieval_scores = evaluate(
    ["chunk_attribution", "chunk_utilization"],
    context=retrieved_context,
    output=generated_answer,
    model="turing_small",
)

# Group 3: input + output (completeness) and input + output + context (factual_accuracy)
completeness = evaluate(
    "completeness",
    input=query,
    output=generated_answer,
    model="turing_small",
)

accuracy = evaluate(
    "factual_accuracy",
    input=query,
    output=generated_answer,
    context=retrieved_context,
    model="turing_small",
)

# Merge all results
all_results = [relevance] + list(retrieval_scores) + [completeness, accuracy]

print("=== RAG Pipeline Diagnostic ===\n")
for r in all_results:
    status = "PASS" if r.passed else "FAIL"
    print(f"{r.eval_name:<22} score={str(r.score):<10} {status}")
    print(f"  Reason: {r.reason}\n")

Expected output (scores may vary):

=== RAG Pipeline Diagnostic ===

context_relevance      score=0.75       PASS
  Reason: Three of four chunks are relevant; Chunk 4 is off-topic.

chunk_attribution      score=Passed     PASS
  Reason: Every output claim maps to a specific context chunk.

chunk_utilization      score=0.75       PASS
  Reason: Output uses 3 of 4 chunks; Chunk 4 is unused.

completeness           score=1.0        PASS
  Reason: The response fully addresses all parts of the query.

factual_accuracy       score=1.0        PASS
  Reason: All stated facts are accurate and confirmed by the context.

Interpret results — retrieval problem or generation problem?

Use the diagnostic output to decide where to focus your effort:

PatternDiagnosisFix
Low context_relevance + low chunk_utilizationRetriever fetches irrelevant chunksImprove embeddings, re-rank, or tune top-k
High context_relevance + low chunk_attributionLLM fabricates claims beyond the contextAdd grounding instructions to the system prompt
High context_relevance + low completenessLLM doesn’t fully address the queryRestructure the prompt to cover all parts of the question
High context_relevance + low factual_accuracyLLM distorts facts from the contextSwitch to a more capable model or reduce temperature
All highPipeline is working wellMonitor over time for regressions
# Quick decision logic you can add to a CI pipeline
scores = {r.eval_name: r for r in all_results}

retrieval_ok = (
    scores["context_relevance"].passed
    and scores["chunk_utilization"].passed
)
generation_ok = (
    scores["completeness"].passed
    and scores["factual_accuracy"].passed
)

if not retrieval_ok:
    print("Action: Improve retrieval — check embeddings, re-ranking, or top-k settings.")
elif not generation_ok:
    print("Action: Improve generation — tune the prompt, lower temperature, or switch models.")
else:
    print("Pipeline healthy.")

Note

For deeper hallucination analysis (checking whether the output contradicts or drifts from the context), combine these metrics with faithfulness and groundedness from Hallucination Detection.

What you built

You can now evaluate any RAG pipeline end-to-end, isolating retrieval failures from generation failures with five targeted metrics and a diagnostic decision framework.

  • Scored retrieval quality with context_relevance, chunk_attribution, and chunk_utilization to check whether the right chunks were fetched and used
  • Scored generation quality with completeness (did the response address the full query?) and factual_accuracy (are the facts correct?)
  • Ran all five metrics grouped by required input keys for a full RAG diagnostic
  • Built a decision framework to isolate retrieval failures from generation failures
Was this page helpful?

Questions & Discussion