RAG Evaluation: Retrieval vs Generation

Score RAG retrieval and generation quality independently with five metrics in one evaluate() call to pinpoint whether failures are at retrieval or generation.

📝

TL;DR

Score retrieval quality and generation quality independently with five metrics in a single evaluate() call to pinpoint whether your RAG pipeline fails at retrieval or generation.

Time	Difficulty	Package
15 min	Intermediate	`ai-evaluation`

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
Python 3.9+

Install

pip install ai-evaluation

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"

RAG evaluation metrics at a glance

A RAG pipeline has two stages that can fail independently: retrieval (did you fetch the right chunks?) and generation (did the LLM use those chunks correctly?). These five metrics help you isolate the problem.

Metric	Stage	Required keys	Output type	What it measures
`context_relevance`	Retrieval	`context`, `input`	score	Are the retrieved chunks relevant to the query?
`chunk_attribution`	Retrieval	`context`, `output`	Pass/Fail	Was the context chunk used in generating the response?
`chunk_utilization`	Retrieval	`context`, `output`	score	How effectively does the response use the context chunks?
`completeness`	Generation	`input`, `output`	score	Does the response fully address all parts of the query?
`factual_accuracy`	Generation	`output`, `context`; `input` optional	score	Are the facts in the output correct?

For hallucination-specific metrics (faithfulness, groundedness, and context_adherence), see Hallucination Detection.

Set up a RAG test case

Define a realistic query, retrieved context chunks, and generated answer. This example simulates a company knowledge-base RAG system.

query = "What is the refund policy and how long does processing take?"

retrieved_context = (
    "Chunk 1: Customers may request a full refund within 30 days of purchase. "
    "Refunds are processed within 5-7 business days after approval. "
    "Chunk 2: To initiate a refund, contact support@example.com with your order number. "
    "Chunk 3: Gift cards and promotional items are non-refundable. "
    "Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)

generated_answer = (
    "You can request a full refund within 30 days of purchase. "
    "Once approved, refunds are processed in 5-7 business days. "
    "To start, email support@example.com with your order number. "
    "Gift cards and promotional items cannot be refunded."
)

Note that Chunk 4 is irrelevant to the query — a common retrieval problem. The metrics below will surface this.

Score retrieval quality

These three metrics evaluate whether your retriever fetched the right chunks and whether the LLM actually used them.

from fi.evals import evaluate

query = "What is the refund policy and how long does processing take?"

retrieved_context = (
    "Chunk 1: Customers may request a full refund within 30 days of purchase. "
    "Refunds are processed within 5-7 business days after approval. "
    "Chunk 2: To initiate a refund, contact support@example.com with your order number. "
    "Chunk 3: Gift cards and promotional items are non-refundable. "
    "Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)

generated_answer = (
    "You can request a full refund within 30 days of purchase. "
    "Once approved, refunds are processed in 5-7 business days. "
    "To start, email support@example.com with your order number. "
    "Gift cards and promotional items cannot be refunded."
)

# Context relevance — are the retrieved chunks relevant to the query?
# Required: context, input
relevance = evaluate(
    "context_relevance",
    context=retrieved_context,
    input=query,
    model="turing_small",
)
print(f"Context Relevance : score={relevance.score}  passed={relevance.passed}")
print(f"  Reason: {relevance.reason}\n")

# Chunk attribution — was the context chunk used in the response?
# Required: context, output
attribution = evaluate(
    "chunk_attribution",
    output=generated_answer,
    context=retrieved_context,
    model="turing_small",
)
print(f"Chunk Attribution : score={attribution.score}  passed={attribution.passed}")
print(f"  Reason: {attribution.reason}\n")

# Chunk utilization — how effectively does the response use the context?
# Required: context, output
utilization = evaluate(
    "chunk_utilization",
    output=generated_answer,
    context=retrieved_context,
    model="turing_small",
)
print(f"Chunk Utilization : score={utilization.score}  passed={utilization.passed}")
print(f"  Reason: {utilization.reason}\n")

Expected output (scores may vary):

Context Relevance : score=0.75  passed=True
  Reason: Three of four chunks are relevant to the query; Chunk 4 is unrelated.

Chunk Attribution : score=Passed  passed=True
  Reason: Every claim in the output maps to a specific context chunk.

Chunk Utilization : score=0.75  passed=True
  Reason: The output uses content from 3 of 4 retrieved chunks.

Tip

Low context_relevance or chunk_utilization with high chunk_attribution means your retriever is fetching irrelevant chunks. Fix your embedding model or retrieval logic. High relevance but low attribution means the LLM is generating claims not grounded in any chunk.

Score generation quality

These metrics evaluate whether the LLM fully addressed the query and produced factually accurate claims.

from fi.evals import evaluate

query = "What is the refund policy and how long does processing take?"

retrieved_context = (
    "Chunk 1: Customers may request a full refund within 30 days of purchase. "
    "Refunds are processed within 5-7 business days after approval. "
    "Chunk 2: To initiate a refund, contact support@example.com with your order number. "
    "Chunk 3: Gift cards and promotional items are non-refundable. "
    "Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)

generated_answer = (
    "You can request a full refund within 30 days of purchase. "
    "Once approved, refunds are processed in 5-7 business days. "
    "To start, email support@example.com with your order number. "
    "Gift cards and promotional items cannot be refunded."
)

# Completeness — does the response fully address the query?
# Required: input, output
completeness = evaluate(
    "completeness",
    input=query,
    output=generated_answer,
    model="turing_small",
)
print(f"Completeness     : score={completeness.score}  passed={completeness.passed}")
print(f"  Reason: {completeness.reason}\n")

# Factual accuracy — are the facts in the output correct?
# Required: output, context; input optional
accuracy = evaluate(
    "factual_accuracy",
    input=query,
    output=generated_answer,
    context=retrieved_context,
    model="turing_small",
)
print(f"Factual Accuracy : score={accuracy.score}  passed={accuracy.passed}")
print(f"  Reason: {accuracy.reason}\n")

Expected output (scores may vary):

Completeness     : score=1.0  passed=True
  Reason: The response fully addresses the query including refund eligibility, processing time, and exceptions.

Factual Accuracy : score=1.0  passed=True
  Reason: All stated facts are accurate and confirmed by the provided context.

Run all metrics for a diagnostic overview

Each RAG metric requires different input keys, so group them by required keys.

from fi.evals import evaluate

query = "What is the refund policy and how long does processing take?"

retrieved_context = (
    "Chunk 1: Customers may request a full refund within 30 days of purchase. "
    "Refunds are processed within 5-7 business days after approval. "
    "Chunk 2: To initiate a refund, contact support@example.com with your order number. "
    "Chunk 3: Gift cards and promotional items are non-refundable. "
    "Chunk 4: Our company was founded in 2015 and is headquartered in San Francisco."
)

generated_answer = (
    "You can request a full refund within 30 days of purchase. "
    "Once approved, refunds are processed in 5-7 business days. "
    "To start, email support@example.com with your order number. "
    "Gift cards and promotional items cannot be refunded."
)

# Group 1: context + input
relevance = evaluate(
    "context_relevance",
    context=retrieved_context,
    input=query,
    model="turing_small",
)

# Group 2: context + output
retrieval_scores = evaluate(
    ["chunk_attribution", "chunk_utilization"],
    context=retrieved_context,
    output=generated_answer,
    model="turing_small",
)

# Group 3: input + output (completeness) and input + output + context (factual_accuracy)
completeness = evaluate(
    "completeness",
    input=query,
    output=generated_answer,
    model="turing_small",
)

accuracy = evaluate(
    "factual_accuracy",
    input=query,
    output=generated_answer,
    context=retrieved_context,
    model="turing_small",
)

# Merge all results
all_results = [relevance] + list(retrieval_scores) + [completeness, accuracy]

print("=== RAG Pipeline Diagnostic ===\n")
for r in all_results:
    status = "PASS" if r.passed else "FAIL"
    print(f"{r.eval_name:<22} score={str(r.score):<10} {status}")
    print(f"  Reason: {r.reason}\n")

Expected output (scores may vary):

=== RAG Pipeline Diagnostic ===

context_relevance      score=0.75       PASS
  Reason: Three of four chunks are relevant; Chunk 4 is off-topic.

chunk_attribution      score=Passed     PASS
  Reason: Every output claim maps to a specific context chunk.

chunk_utilization      score=0.75       PASS
  Reason: Output uses 3 of 4 chunks; Chunk 4 is unused.

completeness           score=1.0        PASS
  Reason: The response fully addresses all parts of the query.

factual_accuracy       score=1.0        PASS
  Reason: All stated facts are accurate and confirmed by the context.

Interpret results — retrieval problem or generation problem?

Use the diagnostic output to decide where to focus your effort:

Pattern	Diagnosis	Fix
Low `context_relevance` + low `chunk_utilization`	Retriever fetches irrelevant chunks	Improve embeddings, re-rank, or tune top-k
High `context_relevance` + low `chunk_attribution`	LLM fabricates claims beyond the context	Add grounding instructions to the system prompt
High `context_relevance` + low `completeness`	LLM doesn’t fully address the query	Restructure the prompt to cover all parts of the question
High `context_relevance` + low `factual_accuracy`	LLM distorts facts from the context	Switch to a more capable model or reduce temperature
All high	Pipeline is working well	Monitor over time for regressions

# Quick decision logic you can add to a CI pipeline
scores = {r.eval_name: r for r in all_results}

retrieval_ok = (
    scores["context_relevance"].passed
    and scores["chunk_utilization"].passed
)
generation_ok = (
    scores["completeness"].passed
    and scores["factual_accuracy"].passed
)

if not retrieval_ok:
    print("Action: Improve retrieval — check embeddings, re-ranking, or top-k settings.")
elif not generation_ok:
    print("Action: Improve generation — tune the prompt, lower temperature, or switch models.")
else:
    print("Pipeline healthy.")

Note

For deeper hallucination analysis (checking whether the output contradicts or drifts from the context), combine these metrics with faithfulness and groundedness from Hallucination Detection.

What you built

You can now evaluate any RAG pipeline end-to-end, isolating retrieval failures from generation failures with five targeted metrics and a diagnostic decision framework.

Scored retrieval quality with context_relevance, chunk_attribution, and chunk_utilization to check whether the right chunks were fetched and used
Scored generation quality with completeness (did the response address the full query?) and factual_accuracy (are the facts correct?)
Ran all five metrics grouped by required input keys for a full RAG diagnostic
Built a decision framework to isolate retrieval failures from generation failures

Questions & Discussion

RAG Evaluation: Retrieval vs Generation

Install

RAG evaluation metrics at a glance

Set up a RAG test case

Score retrieval quality

Score generation quality

Run all metrics for a diagnostic overview

Interpret results — retrieval problem or generation problem?

What you built

Hallucination Detection

Running Your First Eval

Eval in CI/CD

All Built-in Metrics