Running Your First Eval

Score LLM outputs for hallucination, toxicity, and custom quality criteria — from local metrics to LLM-as-Judge.

📝
TL;DR

Score LLM responses three ways: fast local metrics (zero credentials), FutureAGI Turing evaluation models, and custom LLM-as-Judge criteria — all using a single evaluate() function.

Open in ColabGitHub
TimeDifficultyPackage
10 minBeginnerai-evaluation
Prerequisites

Install

pip install 'ai-evaluation[nli]'

The [nli] extra installs the local NLI model used by faithfulness and contradiction_detection. Without it, these metrics fall back to a less accurate word-overlap heuristic.

Tutorial

Run a local metric (no API key)

Local metrics run entirely on your machine — no network call, no API key, instant.

from fi.evals import evaluate

result = evaluate("contains", output="Your order has shipped!", keyword="shipped")

print(result.score)   # 1.0
print(result.passed)  # True
print(result.reason)  # "Keyword 'shipped' found"

Try a few more:

from fi.evals import evaluate

evaluate("equals", output="Paris", expected_output="Paris").passed          # True
evaluate("is_json", output='{"status": "ok"}').passed                       # True
evaluate("length_less_than", output="Short reply.", max_length=100).passed   # True

result = evaluate("levenshtein_similarity", output="colour", expected_output="color")
print(result.score)  # similarity score between 0 and 1

Tip

Use local metrics in unit tests and CI pipelines. Full metric reference: future-agi/ai-evaluation.

Detect contradictions with local NLI

The NLI model runs locally — no API key required.

from fi.evals import evaluate

# Supported response
result = evaluate(
    "contradiction_detection",
    output="The Eiffel Tower is located in Paris, France.",
    context="The Eiffel Tower is a wrought-iron lattice tower located in Paris.",
)
print(f"Score: {result.score:.2f}")
print(f"Passed: {result.passed}")

# Contradictory response
result = evaluate(
    "contradiction_detection",
    output="The Eiffel Tower is located in London, England.",
    context="The Eiffel Tower is a wrought-iron lattice tower located in Paris.",
)
print(f"Score: {result.score:.2f}")
print(f"Passed: {result.passed}")
print(f"Why: {result.reason}")

Note

For highest accuracy, install the NLI extra: pip install 'ai-evaluation[nli]'. Without it, a simpler fallback runs.

Score with FutureAGI Turing models

For quality, tone, safety, and semantic evaluations, use FutureAGI’s purpose-built Turing evaluation models.

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
from fi.evals import evaluate

# Toxicity check
result = evaluate(
    "toxicity",
    output="You're amazing, keep it up!",
    model="turing_small",
)
print(f"Toxicity score: {result.score}")
print(f"Passed: {result.passed}")

# Try a problematic response
result = evaluate(
    "toxicity",
    output="I hate you and everything you stand for.",
    model="turing_small",
)
print(f"Score: {result.score}")
print(f"Why: {result.reason}")
ModelLatencyModalitiesBest for
turing_flashLowestText, ImageHigh-volume pipelines
turing_smallBalancedText, ImageRecommended default
turing_largeHighest accuracyText, Image, Audio, PDFMulti-modal evaluation

Explore all 72+ built-in eval metrics: tone, context_adherence, completeness, groundedness, data_privacy, bias_detection, instruction_adherence, and more.

Run multiple metrics at once

Pass a list of metric names to run several evals in one call. Returns a BatchResult you can iterate.

from fi.evals import evaluate

results = evaluate(
    ["toxicity", "groundedness"],
    output="The Eiffel Tower is located in Paris, France.",
    context="The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris.",
    input="Where is the Eiffel Tower?",
    model="turing_small",
)

for result in results:
    status = "PASS" if result.passed else "FAIL"
    print(f"{result.eval_name:<20} score={result.score}  {status}")
    print(f"  Reason: {result.reason}\n")

Note

Different metrics require different input keys — toxicity only needs output, while groundedness needs output + context. When you pass all keys together, each metric picks what it needs and ignores the rest. See the built-in metrics reference for required keys per metric.

Write your own evaluation criteria (LLM-as-Judge)

When no built-in metric fits, describe your quality bar in plain English and use any LLM as the judge.

export GOOGLE_API_KEY="your-google-api-key"
# or: OPENAI_API_KEY, ANTHROPIC_API_KEY (any LiteLLM-supported provider)
from fi.evals import evaluate

result = evaluate(
    prompt="""You are evaluating a customer support response.

    Score 1.0 if the response:
    - Acknowledges the customer's issue clearly
    - Offers a concrete next step or resolution
    - Stays professional and empathetic

    Score 0.5 if it's polite but vague (no clear next step).
    Score 0.0 if it's dismissive, rude, or unhelpful.""",
    output="I understand your frustration with the delayed shipment. I've escalated this to our logistics team and you'll receive a status update within 2 hours.",
    input="My order is 3 weeks late and nobody is responding to my emails.",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)

print(f"Score: {result.score}")
print(f"Why: {result.reason}")

Any LiteLLM model string works: gpt-4o, claude-sonnet-4-20250514, ollama/llama3.2:3b.

Run evaluations on a dataset from the dashboard

  1. Go to app.futureagi.comDataset
  2. Use Add Dataset (quick path: upload a CSV)
  3. Click Evaluate → select a metric → Add & Run
  4. Scores appear as a new column alongside your data

Tip

No sample data? Create rows quickly with Generate Synthetic Data.

What you built

You can now score any LLM output using local metrics, Turing models, batch evaluation, custom LLM-as-Judge criteria, and the FutureAGI dashboard.

  • Local string and similarity metrics in under 1ms with zero credentials
  • Contradiction detection using a local NLI model
  • Content quality and safety scoring with Turing evaluation models
  • Multi-metric batch evaluation with evaluate([...])
  • Custom evaluation criteria in plain English using LLM-as-Judge
  • Dashboard-based evaluation on datasets

Next steps

Was this page helpful?

Questions & Discussion