Skip to main content

The Problem

You have built a RAG-powered support bot for an insurance company. Users are complaining:
  • “The bot said my claim was covered, but it wasn’t”
  • “It gave me the wrong deductible amount”
  • “It pulled up completely irrelevant policy sections”
You need to figure out where the pipeline is failing. Is retrieval pulling the wrong documents? Or is the LLM hallucinating despite having the right context? These are two very different problems with two very different fixes. This cookbook evaluates each stage of the RAG pipeline separately so you know exactly what to fix.

What You Will Learn

  • How to measure retrieval quality with context recall, precision, and utilization
  • How to measure generation quality with faithfulness, groundedness, and answer relevancy
  • How to diagnose four common failure modes: good pipeline, hallucinating LLM, bad retrieval, noisy retrieval
  • How to run a RAG scorecard that tests all metrics at once

Prerequisites

pip install ai-evaluation
No API keys required for the core metrics. An optional section at the end shows LLM-augmented scoring with Gemini.

Case A: Everything Works

Start with a well-functioning pipeline to establish a baseline. The retriever finds the right chunks, and the LLM generates a faithful answer.
from fi.evals import evaluate

question = "Is physical therapy covered under my plan, and what's my copay?"

ground_truth = (
    "Physical therapy is covered under the Gold Plan. The copay is $30 per "
    "visit for in-network providers. Out-of-network physical therapy requires "
    "prior authorization and has a $75 copay. Maximum 30 visits per year."
)

good_chunks = [
    "Gold Plan Coverage -- Physical Therapy: Covered for in-network providers. "
    "Copay: $30 per visit. Maximum 30 visits per calendar year.",

    "Out-of-Network Services: Physical therapy out-of-network requires prior "
    "authorization. Copay: $75 per visit.",

    "Gold Plan Benefits Summary: Includes preventive care, specialist visits, "
    "physical therapy, mental health services, and prescription drug coverage.",
]

good_answer = (
    "Yes, physical therapy is covered under your Gold Plan. For in-network "
    "providers, your copay is $30 per visit, up to 30 visits per year. "
    "If you go out-of-network, you'll need prior authorization and the "
    "copay increases to $75 per visit."
)
Now measure retrieval and generation independently:
# RETRIEVAL -- did we find the right documents?
recall = evaluate(
    "context_recall", output=good_answer,
    context=good_chunks, expected_output=ground_truth,
)
precision = evaluate(
    "context_precision", output=good_answer,
    context=good_chunks, input=question,
)
print(f"Retrieval:")
print(f"  Context recall:    {recall.score:.2f}  (found the right info)")
print(f"  Context precision: {precision.score:.2f}  (chunks are relevant)")

# GENERATION -- is the answer faithful to what was retrieved?
faith = evaluate("faithfulness", output=good_answer, context=good_chunks)
relevancy = evaluate("answer_relevancy", output=good_answer, input=question)
grounded = evaluate("groundedness", output=good_answer, context=good_chunks)
print(f"\nGeneration:")
print(f"  Faithfulness:      {faith.score:.2f}  (answer matches context)")
print(f"  Answer relevancy:  {relevancy.score:.2f}  (addresses the question)")
print(f"  Groundedness:      {grounded.score:.2f}  (grounded in evidence)")
Expected output:
Retrieval:
  Context recall:    0.90  (found the right info)
  Context precision: 0.85  (chunks are relevant)

Generation:
  Faithfulness:      0.92  (answer matches context)
  Answer relevancy:  0.88  (addresses the question)
  Groundedness:      0.90  (grounded in evidence)
All scores are high. The pipeline is working correctly.

Case B: Good Retrieval, Bad Generation (Hallucination)

The retriever finds the right chunks, but the LLM invents facts not in the context.
hallucinated_answer = (
    "Physical therapy is covered with a $15 copay for unlimited visits. "
    "No prior authorization is needed, even for out-of-network providers. "
    "Your plan also covers chiropractic care and acupuncture."
)

faith = evaluate("faithfulness", output=hallucinated_answer, context=good_chunks)
grounded = evaluate("groundedness", output=hallucinated_answer, context=good_chunks)
recall = evaluate(
    "context_recall", output=hallucinated_answer,
    context=good_chunks, expected_output=ground_truth,
)

print(f"Retrieval:")
print(f"  Context recall:    {recall.score:.2f}  (retrieval was fine)")
print(f"\nGeneration:")
print(f"  Faithfulness:      {faith.score:.2f}  (LLM made up facts!)")
print(f"  Groundedness:      {grounded.score:.2f}  (not grounded)")
Expected output:
Retrieval:
  Context recall:    0.85  (retrieval was fine)

Generation:
  Faithfulness:      0.15  (LLM made up facts!)
  Groundedness:      0.10  (not grounded)
Diagnosis: Retrieval is fine — the right documents were found. The LLM is hallucinating. Fix: Add a faithfulness check before sending the response. Use augment=True for higher accuracy.

Case C: Bad Retrieval, Faithful Generation

The retriever pulls completely wrong documents (dental coverage instead of physical therapy), but the LLM faithfully summarizes what it was given.
wrong_chunks = [
    "Silver Plan Dental Coverage: Dental cleanings twice per year. "
    "Copay: $25 for preventive, $100 for restorative procedures.",

    "Employee Assistance Program: 6 free counseling sessions per year. "
    "Available to all plan members and their dependents.",

    "Prescription Drug Formulary: Tier 1 generics $10, Tier 2 preferred "
    "brands $30, Tier 3 specialty $75.",
]

faithful_but_wrong = (
    "Based on your plan documents, dental cleanings have a $25 copay "
    "and you get 6 free counseling sessions. For prescriptions, "
    "generic drugs cost $10."
)

faith = evaluate("faithfulness", output=faithful_but_wrong, context=wrong_chunks)
relevancy = evaluate("answer_relevancy", output=faithful_but_wrong, input=question)
precision = evaluate(
    "context_precision", output=faithful_but_wrong,
    context=wrong_chunks, input=question,
)

print(f"Retrieval:")
print(f"  Context precision: {precision.score:.2f}  (chunks are irrelevant!)")
print(f"\nGeneration:")
print(f"  Faithfulness:      {faith.score:.2f}  (faithful to wrong context)")
print(f"  Answer relevancy:  {relevancy.score:.2f}  (doesn't address the question)")
Expected output:
Retrieval:
  Context precision: 0.05  (chunks are irrelevant!)

Generation:
  Faithfulness:      0.85  (faithful to wrong context)
  Answer relevancy:  0.10  (doesn't address the question)
Diagnosis: Retrieval failure. The LLM was faithful to what it received, but the retrieved chunks had nothing to do with the question. Fix: Improve embedding model, add reranking, check chunk boundaries.
High faithfulness does not mean the answer is correct. It only means the answer matches the retrieved context. If retrieval is wrong, a “faithful” answer is still wrong. Always check both sides.

Case D: Noisy Retrieval

The retriever returns a mix of relevant and irrelevant chunks. Two out of four chunks are noise (holiday schedule, IT help desk).
noisy_chunks = [
    "Gold Plan Coverage -- Physical Therapy: Covered for in-network providers. "
    "Copay: $30 per visit. Maximum 30 visits per calendar year.",

    "Company holiday schedule: New Year's Day, MLK Day, Presidents' Day...",

    "IT Department: To reset your password, visit portal.company.com/reset",

    "Out-of-Network Services: Physical therapy out-of-network requires prior "
    "authorization. Copay: $75 per visit.",
]

precision = evaluate(
    "context_precision", output=good_answer, context=noisy_chunks, input=question,
)
utilization = evaluate(
    "context_utilization", output=good_answer, context=noisy_chunks,
)
noise = evaluate(
    "noise_sensitivity", output=good_answer, context=noisy_chunks, input=question,
)

print(f"Context precision:   {precision.score:.2f}  (only ~50% relevant)")
print(f"Context utilization: {utilization.score:.2f}  (used what was relevant)")
print(f"Noise sensitivity:   {noise.score:.2f}  (affected by noise)")
Diagnosis: Retrieval is pulling in irrelevant documents. Fix: Increase similarity threshold, add metadata filtering, or add a reranker.

RAG Scorecard: All Metrics at Once

Run all metrics in a single batch call to get a quick health check of your pipeline.
batch = evaluate(
    ["faithfulness", "answer_relevancy", "groundedness", "context_utilization"],
    output=good_answer,
    context=good_chunks,
    input=question,
)

for r in batch:
    status = "PASS" if r.passed else "FAIL"
    print(f"{r.eval_name:<25} {r.score:.2f}  {status}")
print(f"\nOverall: {batch.success_rate:.0%} passed")

RAG Debugging Checklist

SymptomMetric to CheckFix
Wrong facts in answerfaithfulness, groundednessFix generation prompt, add guardrails
Irrelevant answeranswer_relevancyFix query understanding or retrieval
Wrong documents retrievedcontext_precision, context_recallImprove embeddings, add reranking
Noise in retrieved chunkscontext_utilization, noise_sensitivityRaise similarity threshold, add filtering

What to Try Next

Now that you can diagnose RAG failures, protect your pipeline’s inputs from attacks.

Next: Guardrails

Build a sub-10ms security middleware that blocks jailbreaks, code injection, PII leaks, and secret exposure.