Skip to main content

The Problem

Your LLM judge keeps scoring paraphrased medical responses too low. You correct it, but the next time a similar case comes up, it makes the same mistake. There is no learning between evaluations. You need a feedback loop: store your corrections, and when a similar input comes up again, retrieve those corrections as few-shot examples so the judge knows how to handle it.

What You Will Learn

  • How to store evaluation corrections in ChromaDB with semantic embeddings
  • How to retrieve similar past corrections via vector search
  • How to inject feedback as few-shot examples into the LLM judge prompt
  • How to compare results with and without feedback
  • How to calibrate optimal thresholds from your feedback data

Prerequisites

pip install ai-evaluation chromadb
export GOOGLE_API_KEY=your-gemini-api-key
This cookbook requires chromadb for persistent vector storage and a GOOGLE_API_KEY for the LLM judge. For testing without ChromaDB, an InMemoryFeedbackStore is also available.

Step 1: Run Faithfulness Without Feedback

First, establish a baseline. Run the faithfulness metric on a case where the heuristic typically struggles — a paraphrased response.
from fi.evals import evaluate

MODEL = "gemini/gemini-2.5-flash"

test_output = "The patient should take ibuprofen twice daily for pain relief"
test_context = "Prescribe ibuprofen 2x per day for pain management"

result_no_feedback = evaluate(
    "faithfulness",
    output=test_output,
    context=test_context,
    model=MODEL,
    augment=True,
)
print(f"Score WITHOUT feedback: {result_no_feedback.score}")
print(f"Reason: {result_no_feedback.reason[:200]}")
The judge may or may not handle this correctly. The key question is: can we make it consistently correct by providing examples?

Step 2: Submit Feedback Corrections

Create a feedback store and submit developer corrections. Each correction maps an (input, output) pair to the score and reason you think is correct.
import tempfile
from fi.evals.core.result import EvalResult
from fi.evals.feedback import (
    FeedbackCollector,
    ChromaFeedbackStore,
    FeedbackRetriever,
)

tmpdir = tempfile.mkdtemp(prefix="fi_feedback_")
store = ChromaFeedbackStore(persist_directory=tmpdir)
collector = FeedbackCollector(store)

corrections = [
    {
        "output": "Apply the cream twice daily",
        "context": "Use topical cream 2x per day",
        "original_score": 0.3,
        "correct_score": 0.95,
        "reason": "Semantically equivalent -- 'twice daily' == '2x per day'",
    },
    {
        "output": "Take 500mg of ibuprofen for pain",
        "context": "Prescribe 500mg ibuprofen for pain management",
        "original_score": 0.4,
        "correct_score": 0.9,
        "reason": "Faithful -- correctly states the prescription",
    },
    {
        "output": "Take this medication forever",
        "context": "Take for 7 days only",
        "original_score": 0.7,
        "correct_score": 0.1,
        "reason": "UNFAITHFUL -- hallucinated 'forever', context says 7 days",
    },
    {
        "output": "Avoid all physical activity",
        "context": "Light exercise is recommended during recovery",
        "original_score": 0.5,
        "correct_score": 0.05,
        "reason": "UNFAITHFUL -- directly contradicts context recommendation",
    },
    {
        "output": "The dosage is 200mg per day",
        "context": "Recommended daily dose: 200 milligrams",
        "original_score": 0.35,
        "correct_score": 0.95,
        "reason": "Faithful -- exact same dosage, just different wording",
    },
]

for c in corrections:
    fake_result = EvalResult(
        eval_name="faithfulness",
        score=c["original_score"],
        reason=f"Heuristic score: {c['original_score']}",
    )
    collector.submit(
        fake_result,
        inputs={"output": c["output"], "context": c["context"]},
        correct_score=c["correct_score"],
        correct_reason=c["reason"],
    )
    print(f"  {c['original_score']:.1f} -> {c['correct_score']:.2f} | {c['reason'][:55]}")

print(f"\nStored entries: {store.count('faithfulness')}")

Step 3: See What Gets Retrieved

Before running the evaluation, inspect which past corrections are retrieved for our test input. The retriever uses semantic similarity to find the most relevant examples.
retriever = FeedbackRetriever(store=store, max_examples=3)
examples = retriever.retrieve_few_shot_examples(
    "faithfulness",
    {"output": test_output, "context": test_context},
)
print(f"Retrieved {len(examples)} similar feedback entries")
The retriever finds corrections about paraphrased medical dosages — exactly the pattern our test case follows.

Step 4: Run Faithfulness With Feedback

Now re-run the same evaluation, but pass the feedback_store. The SDK retrieves similar past corrections and injects them as few-shot examples into the LLM prompt.
result_with_feedback = evaluate(
    "faithfulness",
    output=test_output,
    context=test_context,
    model=MODEL,
    augment=True,
    feedback_store=store,
)
print(f"Score WITH feedback: {result_with_feedback.score}")
print(f"Reason: {result_with_feedback.reason[:200]}")
print(f"Feedback examples used: "
      f"{result_with_feedback.metadata.get('feedback_examples_used', 0)}")

Step 5: Compare

print(f"WITHOUT feedback: score={result_no_feedback.score}")
print(f"WITH feedback:    score={result_with_feedback.score}")
The judge has learned from your past corrections that paraphrases in medical contexts should be scored high when the meaning is preserved.

Bonus: Test an Unfaithful Case

Verify that the feedback loop does not blindly boost scores. When the output genuinely contradicts the context, the judge should still score low.
bad_output = "Stop all medications immediately"
bad_context = "Continue current medication regimen as prescribed"

result_bad = evaluate(
    "faithfulness",
    output=bad_output,
    context=bad_context,
    model=MODEL,
    augment=True,
    feedback_store=store,
)
print(f"Unfaithful case: score={result_bad.score}")
print(f"Reason: {result_bad.reason[:200]}")
The judge retrieves the contradiction examples from our feedback store and correctly scores this low.

Bonus: Calibrate Thresholds

Use your accumulated feedback data to statistically determine the optimal pass/fail threshold.
from fi.evals.feedback import InMemoryFeedbackStore, FeedbackCollector
from fi.evals.core.result import EvalResult

mem_store = InMemoryFeedbackStore()
cal_collector = FeedbackCollector(mem_store)

for c in corrections:
    fake_result = EvalResult(
        eval_name="faithfulness",
        score=c["original_score"],
        reason="",
    )
    cal_collector.submit(
        fake_result,
        inputs={"output": c["output"], "context": c["context"]},
        correct_score=c["correct_score"],
        correct_reason=c["reason"],
    )

profile = cal_collector.calibrate("faithfulness")
print(f"Optimal threshold: {profile.optimal_threshold}")
print(f"Accuracy:          {profile.accuracy_at_threshold:.0%}")
print(f"Sample size:       {profile.sample_size}")
print(f"TP={profile.true_positives} FP={profile.false_positives} "
      f"TN={profile.true_negatives} FN={profile.false_negatives}")

How It Works

Developer submits correction
        |
        v
ChromaDB stores (input, output, correct_score, reason)
with semantic embedding
        |
        v
New evaluation arrives
        |
        v
FeedbackRetriever finds similar past corrections
via vector similarity search
        |
        v
Corrections injected as few-shot examples
into the LLM judge prompt
        |
        v
LLM produces calibrated score informed by
your domain expertise

What to Try Next

You have taught the judge to handle text. Now teach it to handle images and audio.

Next: Multimodal Judge

Pass images and audio URLs to the LLM judge to verify product descriptions, check transcriptions, and more.