You have built a RAG-powered support bot for an insurance company. Users are complaining:
“The bot said my claim was covered, but it wasn’t”
“It gave me the wrong deductible amount”
“It pulled up completely irrelevant policy sections”
You need to figure out where the pipeline is failing. Is retrieval pulling the wrong documents? Or is the LLM hallucinating despite having the right context? These are two very different problems with two very different fixes.This cookbook evaluates each stage of the RAG pipeline separately so you know exactly what to fix.
Start with a well-functioning pipeline to establish a baseline. The retriever finds the right chunks, and the LLM generates a faithful answer.
Copy
Ask AI
from fi.evals import evaluatequestion = "Is physical therapy covered under my plan, and what's my copay?"ground_truth = ( "Physical therapy is covered under the Gold Plan. The copay is $30 per " "visit for in-network providers. Out-of-network physical therapy requires " "prior authorization and has a $75 copay. Maximum 30 visits per year.")good_chunks = [ "Gold Plan Coverage -- Physical Therapy: Covered for in-network providers. " "Copay: $30 per visit. Maximum 30 visits per calendar year.", "Out-of-Network Services: Physical therapy out-of-network requires prior " "authorization. Copay: $75 per visit.", "Gold Plan Benefits Summary: Includes preventive care, specialist visits, " "physical therapy, mental health services, and prescription drug coverage.",]good_answer = ( "Yes, physical therapy is covered under your Gold Plan. For in-network " "providers, your copay is $30 per visit, up to 30 visits per year. " "If you go out-of-network, you'll need prior authorization and the " "copay increases to $75 per visit.")
Now measure retrieval and generation independently:
Copy
Ask AI
# RETRIEVAL -- did we find the right documents?recall = evaluate( "context_recall", output=good_answer, context=good_chunks, expected_output=ground_truth,)precision = evaluate( "context_precision", output=good_answer, context=good_chunks, input=question,)print(f"Retrieval:")print(f" Context recall: {recall.score:.2f} (found the right info)")print(f" Context precision: {precision.score:.2f} (chunks are relevant)")# GENERATION -- is the answer faithful to what was retrieved?faith = evaluate("faithfulness", output=good_answer, context=good_chunks)relevancy = evaluate("answer_relevancy", output=good_answer, input=question)grounded = evaluate("groundedness", output=good_answer, context=good_chunks)print(f"\nGeneration:")print(f" Faithfulness: {faith.score:.2f} (answer matches context)")print(f" Answer relevancy: {relevancy.score:.2f} (addresses the question)")print(f" Groundedness: {grounded.score:.2f} (grounded in evidence)")
Expected output:
Copy
Ask AI
Retrieval: Context recall: 0.90 (found the right info) Context precision: 0.85 (chunks are relevant)Generation: Faithfulness: 0.92 (answer matches context) Answer relevancy: 0.88 (addresses the question) Groundedness: 0.90 (grounded in evidence)
All scores are high. The pipeline is working correctly.
Case B: Good Retrieval, Bad Generation (Hallucination)
The retriever finds the right chunks, but the LLM invents facts not in the context.
Copy
Ask AI
hallucinated_answer = ( "Physical therapy is covered with a $15 copay for unlimited visits. " "No prior authorization is needed, even for out-of-network providers. " "Your plan also covers chiropractic care and acupuncture.")faith = evaluate("faithfulness", output=hallucinated_answer, context=good_chunks)grounded = evaluate("groundedness", output=hallucinated_answer, context=good_chunks)recall = evaluate( "context_recall", output=hallucinated_answer, context=good_chunks, expected_output=ground_truth,)print(f"Retrieval:")print(f" Context recall: {recall.score:.2f} (retrieval was fine)")print(f"\nGeneration:")print(f" Faithfulness: {faith.score:.2f} (LLM made up facts!)")print(f" Groundedness: {grounded.score:.2f} (not grounded)")
Expected output:
Copy
Ask AI
Retrieval: Context recall: 0.85 (retrieval was fine)Generation: Faithfulness: 0.15 (LLM made up facts!) Groundedness: 0.10 (not grounded)
Diagnosis: Retrieval is fine — the right documents were found. The LLM is hallucinating. Fix: Add a faithfulness check before sending the response. Use augment=True for higher accuracy.
The retriever pulls completely wrong documents (dental coverage instead of physical therapy), but the LLM faithfully summarizes what it was given.
Copy
Ask AI
wrong_chunks = [ "Silver Plan Dental Coverage: Dental cleanings twice per year. " "Copay: $25 for preventive, $100 for restorative procedures.", "Employee Assistance Program: 6 free counseling sessions per year. " "Available to all plan members and their dependents.", "Prescription Drug Formulary: Tier 1 generics $10, Tier 2 preferred " "brands $30, Tier 3 specialty $75.",]faithful_but_wrong = ( "Based on your plan documents, dental cleanings have a $25 copay " "and you get 6 free counseling sessions. For prescriptions, " "generic drugs cost $10.")faith = evaluate("faithfulness", output=faithful_but_wrong, context=wrong_chunks)relevancy = evaluate("answer_relevancy", output=faithful_but_wrong, input=question)precision = evaluate( "context_precision", output=faithful_but_wrong, context=wrong_chunks, input=question,)print(f"Retrieval:")print(f" Context precision: {precision.score:.2f} (chunks are irrelevant!)")print(f"\nGeneration:")print(f" Faithfulness: {faith.score:.2f} (faithful to wrong context)")print(f" Answer relevancy: {relevancy.score:.2f} (doesn't address the question)")
Expected output:
Copy
Ask AI
Retrieval: Context precision: 0.05 (chunks are irrelevant!)Generation: Faithfulness: 0.85 (faithful to wrong context) Answer relevancy: 0.10 (doesn't address the question)
Diagnosis: Retrieval failure. The LLM was faithful to what it received, but the retrieved chunks had nothing to do with the question. Fix: Improve embedding model, add reranking, check chunk boundaries.
High faithfulness does not mean the answer is correct. It only means the answer matches the retrieved context. If retrieval is wrong, a “faithful” answer is still wrong. Always check both sides.