Feedback Loops
Submit corrections to scoring results, calibrate thresholds over time, and store feedback in ChromaDB for continuous improvement.
- Submit corrections when a score is wrong — the system learns from them
- Calibrate thresholds per metric using accumulated feedback
- In-memory store for development, ChromaDB for production
When a metric gives a wrong score, submit a correction. Corrections are stored and used in two ways: they feed into threshold calibration (tuning the pass/fail cutoff per metric), and when using LLM-as-Judge with a feedback store, past corrections are injected as few-shot examples to guide the LLM.
Note
Requires pip install ai-evaluation. For persistent storage, install pip install ai-evaluation[feedback] (adds ChromaDB).
Quick Example
from fi.evals import evaluate
from fi.evals.feedback import FeedbackCollector, InMemoryFeedbackStore
store = InMemoryFeedbackStore()
feedback = FeedbackCollector(store)
# Run a check
result = evaluate("faithfulness", output="Paris is in Germany.", context="Paris is the capital of France.")
# The score looks wrong — submit a correction
feedback.submit(
result,
inputs={"output": "Paris is in Germany.", "context": "Paris is the capital of France."},
correct_score=0.0,
correct_reason="Output contradicts context — Paris is in France, not Germany.",
)
FeedbackCollector
Submitting corrections
from fi.evals.feedback import FeedbackCollector, InMemoryFeedbackStore
store = InMemoryFeedbackStore()
feedback = FeedbackCollector(store)
feedback.submit(
result, # EvalResult from evaluate()
inputs={"output": "...", "context": "..."}, # the inputs used
correct_score=0.95, # what the score should have been
correct_passed=True, # what passed should have been
correct_reason="explanation", # why the correction is right
tags=["production", "rag"], # optional tags for filtering
metadata={"reviewer": "alice"}, # optional metadata
)
Threshold Calibration
After collecting corrections (minimum 5 per metric, aim for 20+ for reliable results), calibrate the pass/fail threshold. The calibrator sweeps across threshold values and finds the one that best matches your corrections.
from fi.evals.feedback import FeedbackCollector, ChromaFeedbackStore
store = ChromaFeedbackStore()
feedback = FeedbackCollector(store)
# After collecting 20+ corrections...
profile = feedback.calibrate("faithfulness")
# CalibrationProfile fields:
print(profile.optimal_threshold) # float — recommended threshold
print(profile.accuracy_at_threshold) # float — % of corrections that agree
print(profile.sample_size) # int — number of corrections used
print(profile.score_mean) # float — average corrected score
print(profile.score_std) # float — score standard deviation
Feedback Retrieval
Find similar past corrections to inform current scoring.
from fi.evals.feedback import FeedbackRetriever
retriever = FeedbackRetriever(store)
similar = retriever.retrieve_similar(
metric_name="faithfulness",
inputs={"output": "...", "context": "..."},
top_k=3,
)
for entry in similar:
print(f"Score: {entry.correct_score}, Reason: {entry.correct_reason}")
Storage Options
InMemoryFeedbackStore
For development and testing. Data is lost when the process exits.
from fi.evals.feedback import InMemoryFeedbackStore
store = InMemoryFeedbackStore()
ChromaFeedbackStore
For production. Persists to disk, uses vector search for semantic retrieval of similar corrections.
pip install ai-evaluation[feedback]
from fi.evals.feedback import ChromaFeedbackStore
store = ChromaFeedbackStore() # defaults to local disk
Integrating with evaluate()
Pass a feedback store to evaluate() when using LLM-as-Judge or augmented metrics. The SDK retrieves similar past corrections from the store and injects them as few-shot examples into the LLM prompt, steering the judge toward scores that match your corrections.
from fi.evals import evaluate
from fi.evals.feedback import ChromaFeedbackStore
store = ChromaFeedbackStore()
result = evaluate(
"faithfulness",
output="The Earth orbits the Sun.",
context="The Earth revolves around the Sun in an elliptical orbit.",
model="gemini/gemini-2.5-flash",
augment=True,
feedback_store=store, # similar corrections injected as few-shot examples
)
Note
feedback_store works with augment=True and engine="llm" modes. For purely local metrics (no model), corrections don’t influence scoring directly — use calibrate() to adjust the threshold instead.