Running Your First Eval
Score LLM outputs for hallucination, toxicity, and custom quality criteria — from local metrics to LLM-as-Judge.
Score LLM responses three ways: fast local metrics (zero credentials), FutureAGI Turing evaluation models, and custom LLM-as-Judge criteria — all using a single evaluate() function.
| Time | Difficulty | Package |
|---|---|---|
| 10 min | Beginner | ai-evaluation |
- FutureAGI account → app.futureagi.com
- API keys:
FI_API_KEYandFI_SECRET_KEY(see Get your API keys) - Python 3.9+
Install
pip install 'ai-evaluation[nli]'
The [nli] extra installs the local NLI model used by faithfulness and contradiction_detection. Without it, these metrics fall back to a less accurate word-overlap heuristic.
Tutorial
Run a local metric (no API key)
Local metrics run entirely on your machine — no network call, no API key, instant.
from fi.evals import evaluate
result = evaluate("contains", output="Your order has shipped!", keyword="shipped")
print(result.score) # 1.0
print(result.passed) # True
print(result.reason) # "Keyword 'shipped' found"Try a few more:
from fi.evals import evaluate
evaluate("equals", output="Paris", expected_output="Paris").passed # True
evaluate("is_json", output='{"status": "ok"}').passed # True
evaluate("length_less_than", output="Short reply.", max_length=100).passed # True
result = evaluate("levenshtein_similarity", output="colour", expected_output="color")
print(result.score) # similarity score between 0 and 1Tip
Use local metrics in unit tests and CI pipelines. Full metric reference: future-agi/ai-evaluation.
Detect contradictions with local NLI
The NLI model runs locally — no API key required.
from fi.evals import evaluate
# Supported response
result = evaluate(
"contradiction_detection",
output="The Eiffel Tower is located in Paris, France.",
context="The Eiffel Tower is a wrought-iron lattice tower located in Paris.",
)
print(f"Score: {result.score:.2f}")
print(f"Passed: {result.passed}")
# Contradictory response
result = evaluate(
"contradiction_detection",
output="The Eiffel Tower is located in London, England.",
context="The Eiffel Tower is a wrought-iron lattice tower located in Paris.",
)
print(f"Score: {result.score:.2f}")
print(f"Passed: {result.passed}")
print(f"Why: {result.reason}")Note
For highest accuracy, install the NLI extra: pip install 'ai-evaluation[nli]'. Without it, a simpler fallback runs.
Score with FutureAGI Turing models
For quality, tone, safety, and semantic evaluations, use FutureAGI’s purpose-built Turing evaluation models.
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"from fi.evals import evaluate
# Toxicity check
result = evaluate(
"toxicity",
output="You're amazing, keep it up!",
model="turing_small",
)
print(f"Toxicity score: {result.score}")
print(f"Passed: {result.passed}")
# Try a problematic response
result = evaluate(
"toxicity",
output="I hate you and everything you stand for.",
model="turing_small",
)
print(f"Score: {result.score}")
print(f"Why: {result.reason}")
| Model | Latency | Modalities | Best for |
|---|---|---|---|
turing_flash | Lowest | Text, Image | High-volume pipelines |
turing_small | Balanced | Text, Image | Recommended default |
turing_large | Highest accuracy | Text, Image, Audio, PDF | Multi-modal evaluation |
Explore all 72+ built-in eval metrics: tone, context_adherence, completeness, groundedness, data_privacy, bias_detection, instruction_adherence, and more.
Run multiple metrics at once
Pass a list of metric names to run several evals in one call. Returns a BatchResult you can iterate.
from fi.evals import evaluate
results = evaluate(
["toxicity", "groundedness"],
output="The Eiffel Tower is located in Paris, France.",
context="The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris.",
input="Where is the Eiffel Tower?",
model="turing_small",
)
for result in results:
status = "PASS" if result.passed else "FAIL"
print(f"{result.eval_name:<20} score={result.score} {status}")
print(f" Reason: {result.reason}\n")Note
Different metrics require different input keys — toxicity only needs output, while groundedness needs output + context. When you pass all keys together, each metric picks what it needs and ignores the rest. See the built-in metrics reference for required keys per metric.
Write your own evaluation criteria (LLM-as-Judge)
When no built-in metric fits, describe your quality bar in plain English and use any LLM as the judge.
export GOOGLE_API_KEY="your-google-api-key"
# or: OPENAI_API_KEY, ANTHROPIC_API_KEY (any LiteLLM-supported provider)from fi.evals import evaluate
result = evaluate(
prompt="""You are evaluating a customer support response.
Score 1.0 if the response:
- Acknowledges the customer's issue clearly
- Offers a concrete next step or resolution
- Stays professional and empathetic
Score 0.5 if it's polite but vague (no clear next step).
Score 0.0 if it's dismissive, rude, or unhelpful.""",
output="I understand your frustration with the delayed shipment. I've escalated this to our logistics team and you'll receive a status update within 2 hours.",
input="My order is 3 weeks late and nobody is responding to my emails.",
engine="llm",
model="gemini/gemini-2.5-flash",
)
print(f"Score: {result.score}")
print(f"Why: {result.reason}")Any LiteLLM model string works: gpt-4o, claude-sonnet-4-20250514, ollama/llama3.2:3b.
Run evaluations on a dataset from the dashboard
- Go to app.futureagi.com → Dataset
- Use Add Dataset (quick path: upload a CSV)
- Click Evaluate → select a metric → Add & Run
- Scores appear as a new column alongside your data
Tip
No sample data? Create rows quickly with Generate Synthetic Data.
What you built
You can now score any LLM output using local metrics, Turing models, batch evaluation, custom LLM-as-Judge criteria, and the FutureAGI dashboard.
- Local string and similarity metrics in under 1ms with zero credentials
- Contradiction detection using a local NLI model
- Content quality and safety scoring with Turing evaluation models
- Multi-metric batch evaluation with
evaluate([...]) - Custom evaluation criteria in plain English using LLM-as-Judge
- Dashboard-based evaluation on datasets