Evaluation-Driven Development: Score Every Prompt Change Before Shipping
Build a local eval loop that scores prompts against a test suite, compare before-and-after results, and gate promotion on quality thresholds.
Build a local eval loop that scores every prompt change against a test suite, compares before-and-after results, and gates promotion on quality thresholds.
| Time | Difficulty | Package |
|---|---|---|
| 15 min | Intermediate | ai-evaluation, openai |
- FutureAGI account → app.futureagi.com
- API keys:
FI_API_KEYandFI_SECRET_KEY(see Get your API keys) - Python 3.9+
- OpenAI API key (
OPENAI_API_KEY)
Install
pip install ai-evaluation openai
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
export OPENAI_API_KEY="your-openai-api-key"
Define the eval suite
Each test case has a user input and a context the agent should draw from.
TEST_CASES = [
{
"input": "What is your return window for electronics?",
"context": (
"Electronics may be returned within 30 days of purchase with original "
"packaging and proof of purchase. Items must be in unused condition."
),
},
{
"input": "My order arrived damaged. What should I do?",
"context": (
"Customers who receive damaged items should photograph the damage and "
"contact support within 48 hours. A replacement or full refund will be "
"issued after review."
),
},
{
"input": "Can I return a sale item for a full refund?",
"context": (
"Sale items are eligible for exchange only. Full refunds are not available "
"on sale purchases. Store credit may be offered at management discretion."
),
},
{
"input": "How long does standard shipping take?",
"context": (
"Standard shipping takes 5-7 business days within the continental US. "
"Expedited options (2-day and overnight) are available at checkout."
),
},
{
"input": "Do you price-match competitors?",
"context": (
"We offer a price-match guarantee for identical items sold by authorized "
"retailers. The match must be requested at the time of purchase. "
"Marketplace sellers and auction sites are excluded."
),
},
] Write the scoring function
score_prompt() calls OpenAI for each test case, runs two evals on every response, and returns per-metric pass rates.
| Metric | Engine | model= required? |
|---|---|---|
faithfulness | Local NLI | No; do NOT pass model= |
toxicity | FutureAGI Turing | Yes; pass model="turing_small" |
import os
from openai import OpenAI
from fi.evals import evaluate
openai_client = OpenAI()
def score_prompt(prompt_template: str, test_cases: list) -> dict:
faithfulness_passes = 0
toxicity_passes = 0
per_case = []
for case in test_cases:
system_prompt = prompt_template.format(context=case["context"])
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": case["input"]},
],
)
output = response.choices[0].message.content
# faithfulness: local NLI — no model= argument
faith_result = evaluate(
"faithfulness",
output=output,
context=case["context"],
)
# toxicity: Turing metric — model= is required
tox_result = evaluate(
"toxicity",
output=output,
model="turing_small",
)
if faith_result.passed:
faithfulness_passes += 1
if tox_result.passed:
toxicity_passes += 1
per_case.append({
"input": case["input"],
"output": output,
"faithfulness_score": faith_result.score,
"faithfulness_pass": faith_result.passed,
"faithfulness_reason": faith_result.reason,
"toxicity_score": tox_result.score,
"toxicity_pass": tox_result.passed,
"toxicity_reason": tox_result.reason,
})
n = len(test_cases)
faith_rate = faithfulness_passes / n
tox_rate = toxicity_passes / n
return {
"faithfulness": faith_rate,
"toxicity": tox_rate,
"composite": (faith_rate + tox_rate) / 2,
"per_case": per_case,
} Score baseline, revise, and compare
Start with a thin prompt, score it, then revise and re-score.
BASELINE_PROMPT = """\
You are a customer support agent.
Answer the customer's question using the information below.
Context:
{context}
"""
REVISED_PROMPT = """\
You are a friendly and professional customer support agent for an e-commerce retailer.
INSTRUCTIONS:
1. Answer ONLY using the information provided in the Context section below.
2. Do NOT add policies, timeframes, or details that are not stated in the Context.
3. If the Context does not contain enough information to fully answer the question,
say so clearly and offer to escalate to the support team.
4. Keep your response concise (2-4 sentences), empathetic, and solution-focused.
Context:
{context}
"""
def print_results(label: str, results: dict):
print(f"\n{'='*40}")
print(f" {label}")
print(f"{'='*40}")
print(f"{'Metric':<16} {'Pass rate':>10}")
print("-" * 28)
print(f"{'faithfulness':<16} {results['faithfulness']:>9.0%}")
print(f"{'toxicity':<16} {results['toxicity']:>9.0%}")
print(f"{'composite':<16} {results['composite']:>9.0%}")
for i, case in enumerate(results["per_case"], 1):
faith = "PASS" if case["faithfulness_pass"] else "FAIL"
tox = "PASS" if case["toxicity_pass"] else "FAIL"
print(f" [{i}] {case['input'][:50]:<52} faith={faith} tox={tox}")
# Run both
baseline = score_prompt(BASELINE_PROMPT, TEST_CASES)
revised = score_prompt(REVISED_PROMPT, TEST_CASES)
print_results("BASELINE", baseline)
print_results("REVISED", revised)
# Show delta
print(f"\n--- Improvement ---")
print(f"faithfulness: {baseline['faithfulness']:.0%} → {revised['faithfulness']:.0%}")
print(f"toxicity: {baseline['toxicity']:.0%} → {revised['toxicity']:.0%}")
print(f"composite: {baseline['composite']:.0%} → {revised['composite']:.0%}")Expected output:
========================================
BASELINE
========================================
Metric Pass rate
----------------------------
faithfulness 60%
toxicity 80%
composite 70%
[1] What is your return window for electronics? faith=PASS tox=PASS
[2] My order arrived damaged. What should I do? faith=FAIL tox=PASS
[3] Can I return a sale item for a full refund? faith=PASS tox=PASS
[4] How long does standard shipping take? faith=PASS tox=PASS
[5] Do you price-match competitors? faith=FAIL tox=FAIL
========================================
REVISED
========================================
Metric Pass rate
----------------------------
faithfulness 80%
toxicity 100%
composite 90%
[1] What is your return window for electronics? faith=PASS tox=PASS
[2] My order arrived damaged. What should I do? faith=PASS tox=PASS
[3] Can I return a sale item for a full refund? faith=PASS tox=PASS
[4] How long does standard shipping take? faith=PASS tox=PASS
[5] Do you price-match competitors? faith=FAIL tox=PASS
--- Improvement ---
faithfulness: 60% → 80%
toxicity: 80% → 100%
composite: 70% → 90%If any case still fails, inspect the reason to understand what the NLI model flagged:
for case in revised["per_case"]:
if not case["faithfulness_pass"]:
print(f"Input: {case['input']}")
print(f"Output: {case['output']}")
print(f"Score: {case['faithfulness_score']:.2f}")
print(f"Reason: {case['faithfulness_reason']}") Gate promotion on eval thresholds
Block promotion if any metric falls below your quality bar. The non-zero exit code integrates with Makefiles, pre-commit hooks, and CI scripts.
import sys
FAITHFULNESS_THRESHOLD = 0.75
TOXICITY_THRESHOLD = 0.80
results = score_prompt(REVISED_PROMPT, TEST_CASES)
print(f"faithfulness: {results['faithfulness']:.0%} (threshold: {FAITHFULNESS_THRESHOLD:.0%})")
print(f"toxicity: {results['toxicity']:.0%} (threshold: {TOXICITY_THRESHOLD:.0%})")
try:
assert results["faithfulness"] >= FAITHFULNESS_THRESHOLD, (
f"Faithfulness too low: {results['faithfulness']:.0%} < {FAITHFULNESS_THRESHOLD:.0%}"
)
assert results["toxicity"] >= TOXICITY_THRESHOLD, (
f"Toxicity too low: {results['toxicity']:.0%} < {TOXICITY_THRESHOLD:.0%}"
)
print("\nPrompt approved for production push.")
sys.exit(0)
except AssertionError as e:
print(f"\nGATE FAILED: {e}")
print("Fix the prompt and re-run before promoting.")
sys.exit(1)Tip
Once the local gate passes, automate the same checks on every pull request. See Automated Eval in CI/CD for the full GitHub Actions setup with PR comments and branch protection.
What you built
You can now score prompt changes with automated evals, compare before-and-after results, and gate promotion on quality thresholds.
- Defined a 5-case eval suite with realistic inputs and grounding contexts
- Wrote
score_prompt()runningfaithfulness(local NLI) andtoxicity(Turing) on every response - Scored a baseline prompt at 60%/80%, then revised to 80%/100% by adding grounding and tone instructions
- Inspected failing cases using
EvalResult.reasonto identify unsupported claims - Added a promotion gate that exits non-zero when quality is insufficient