Feedback Loops: Calibrate and Improve Evaluation Scores

Submit corrections to scoring results, calibrate thresholds over time, and store feedback in ChromaDB for continuous improvement.

📝

TL;DR

Submit corrections when a score is wrong — the system learns from them
Calibrate thresholds per metric using accumulated feedback
In-memory store for development, ChromaDB for production

When a metric gives a wrong score, submit a correction. Corrections are stored and used in two ways: they feed into threshold calibration (tuning the pass/fail cutoff per metric), and when using LLM-as-Judge with a feedback store, past corrections are injected as few-shot examples to guide the LLM.

Note

Requires pip install ai-evaluation. For persistent storage, install pip install ai-evaluation[feedback] (adds ChromaDB).

Quick Example

from fi.evals import evaluate
from fi.evals.feedback import FeedbackCollector, InMemoryFeedbackStore

store = InMemoryFeedbackStore()
feedback = FeedbackCollector(store)

# Run a check
result = evaluate("faithfulness", output="Paris is in Germany.", context="Paris is the capital of France.")

# The score looks wrong — submit a correction
feedback.submit(
    result,
    inputs={"output": "Paris is in Germany.", "context": "Paris is the capital of France."},
    correct_score=0.0,
    correct_reason="Output contradicts context — Paris is in France, not Germany.",
)

FeedbackCollector

Submitting corrections

from fi.evals.feedback import FeedbackCollector, InMemoryFeedbackStore

store = InMemoryFeedbackStore()
feedback = FeedbackCollector(store)

feedback.submit(
    result,                          # EvalResult from evaluate()
    inputs={"output": "...", "context": "..."},  # the inputs used
    correct_score=0.95,              # what the score should have been
    correct_passed=True,             # what passed should have been
    correct_reason="explanation",    # why the correction is right
    tags=["production", "rag"],      # optional tags for filtering
    metadata={"reviewer": "alice"},  # optional metadata
)

Threshold Calibration

After collecting corrections (minimum 5 per metric, aim for 20+ for reliable results), calibrate the pass/fail threshold. The calibrator sweeps across threshold values and finds the one that best matches your corrections.

from fi.evals.feedback import FeedbackCollector, ChromaFeedbackStore

store = ChromaFeedbackStore()
feedback = FeedbackCollector(store)

# After collecting 20+ corrections...
profile = feedback.calibrate("faithfulness")

# CalibrationProfile fields:
print(profile.optimal_threshold)      # float — recommended threshold
print(profile.accuracy_at_threshold)  # float — % of corrections that agree
print(profile.sample_size)            # int — number of corrections used
print(profile.score_mean)             # float — average corrected score
print(profile.score_std)              # float — score standard deviation

Feedback Retrieval

Find similar past corrections to inform current scoring.

from fi.evals.feedback import FeedbackRetriever

retriever = FeedbackRetriever(store)

similar = retriever.retrieve_similar(
    metric_name="faithfulness",
    inputs={"output": "...", "context": "..."},
    top_k=3,
)

for entry in similar:
    print(f"Score: {entry.correct_score}, Reason: {entry.correct_reason}")

Storage Options

InMemoryFeedbackStore

For development and testing. Data is lost when the process exits.

from fi.evals.feedback import InMemoryFeedbackStore

store = InMemoryFeedbackStore()

ChromaFeedbackStore

For production. Persists to disk, uses vector search for semantic retrieval of similar corrections.

pip install ai-evaluation[feedback]

from fi.evals.feedback import ChromaFeedbackStore

store = ChromaFeedbackStore()  # defaults to local disk

Integrating with evaluate()

Pass a feedback store to evaluate() when using LLM-as-Judge or augmented metrics. The SDK retrieves similar past corrections from the store and injects them as few-shot examples into the LLM prompt, steering the judge toward scores that match your corrections.

from fi.evals import evaluate
from fi.evals.feedback import ChromaFeedbackStore

store = ChromaFeedbackStore()

result = evaluate(
    "faithfulness",
    output="The Earth orbits the Sun.",
    context="The Earth revolves around the Sun in an elliptical orbit.",
    model="gemini/gemini-2.5-flash",
    augment=True,
    feedback_store=store,  # similar corrections injected as few-shot examples
)

Note

feedback_store works with augment=True and engine="llm" modes. For purely local metrics (no model), corrections don’t influence scoring directly — use calibrate() to adjust the threshold instead.

Questions & Discussion

Feedback Loops: Calibrate and Improve Evaluation Scores

Quick Example

FeedbackCollector

Submitting corrections

Threshold Calibration

Feedback Retrieval

Storage Options

InMemoryFeedbackStore

ChromaFeedbackStore

Integrating with evaluate()

Running Evaluations

LLM-as-Judge

Overview

Quick Example

FeedbackCollector

Submitting corrections

Threshold Calibration

Feedback Retrieval

Storage Options

InMemoryFeedbackStore

ChromaFeedbackStore

Integrating with evaluate()

Related

Running Evaluations

LLM-as-Judge

Overview