Running Your First Future AGI Evaluation

Score LLM outputs for hallucination, toxicity, and custom criteria using local metrics, Future AGI evaluation models, or LLM-as-Judge.

📝

TL;DR

Score LLM responses three ways: fast local metrics (zero credentials), FutureAGI Turing evaluation models, and custom LLM-as-Judge criteria — all using a single evaluate() function.

Time	Difficulty	Package
10 min	Beginner	`ai-evaluation`

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
Python 3.9+

Install

pip install 'ai-evaluation[nli]'

The [nli] extra installs the local NLI model used by faithfulness and contradiction_detection. Without it, these metrics fall back to a less accurate word-overlap heuristic.

Tutorial

Run a local metric (no API key)

Local metrics run entirely on your machine — no network call, no API key, instant.

from fi.evals import evaluate

result = evaluate("contains", output="Your order has shipped!", keyword="shipped")

print(result.score)   # 1.0
print(result.passed)  # True
print(result.reason)  # "Keyword 'shipped' found"

Try a few more:

from fi.evals import evaluate

evaluate("equals", output="Paris", expected_output="Paris").passed          # True
evaluate("is_json", output='{"status": "ok"}').passed                       # True
evaluate("length_less_than", output="Short reply.", max_length=100).passed   # True

result = evaluate("levenshtein_similarity", output="colour", expected_output="color")
print(result.score)  # similarity score between 0 and 1

Tip

Use local metrics in unit tests and CI pipelines. Full metric reference: future-agi/ai-evaluation.

Detect contradictions with local NLI

The NLI model runs locally — no API key required.

from fi.evals import evaluate

# Supported response
result = evaluate(
    "contradiction_detection",
    output="The Eiffel Tower is located in Paris, France.",
    context="The Eiffel Tower is a wrought-iron lattice tower located in Paris.",
)
print(f"Score: {result.score:.2f}")
print(f"Passed: {result.passed}")

# Contradictory response
result = evaluate(
    "contradiction_detection",
    output="The Eiffel Tower is located in London, England.",
    context="The Eiffel Tower is a wrought-iron lattice tower located in Paris.",
)
print(f"Score: {result.score:.2f}")
print(f"Passed: {result.passed}")
print(f"Why: {result.reason}")

Note

For highest accuracy, install the NLI extra: pip install 'ai-evaluation[nli]'. Without it, a simpler fallback runs.

Score with FutureAGI Turing models

For quality, tone, safety, and semantic evaluations, use FutureAGI’s purpose-built Turing evaluation models.

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"

from fi.evals import evaluate

# Toxicity check
result = evaluate(
    "toxicity",
    output="You're amazing, keep it up!",
    model="turing_small",
)
print(f"Toxicity score: {result.score}")
print(f"Passed: {result.passed}")

# Try a problematic response
result = evaluate(
    "toxicity",
    output="I hate you and everything you stand for.",
    model="turing_small",
)
print(f"Score: {result.score}")
print(f"Why: {result.reason}")

Model	Latency	Modalities	Best for
`turing_flash`	Lowest	Text, Image	High-volume pipelines
`turing_small`	Balanced	Text, Image	Recommended default
`turing_large`	Highest accuracy	Text, Image, Audio, PDF	Multi-modal evaluation

Explore all 72+ built-in eval metrics: tone, context_adherence, completeness, groundedness, data_privacy, bias_detection, instruction_adherence, and more.

Run multiple metrics at once

Pass a list of metric names to run several evals in one call. Returns a BatchResult you can iterate.

from fi.evals import evaluate

results = evaluate(
    ["toxicity", "groundedness"],
    output="The Eiffel Tower is located in Paris, France.",
    context="The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris.",
    input="Where is the Eiffel Tower?",
    model="turing_small",
)

for result in results:
    status = "PASS" if result.passed else "FAIL"
    print(f"{result.eval_name:<20} score={result.score}  {status}")
    print(f"  Reason: {result.reason}\n")

Note

Different metrics require different input keys — toxicity only needs output, while groundedness needs output + context. When you pass all keys together, each metric picks what it needs and ignores the rest. See the built-in metrics reference for required keys per metric.

Write your own evaluation criteria (LLM-as-Judge)

When no built-in metric fits, describe your quality bar in plain English and use any LLM as the judge.

export GOOGLE_API_KEY="your-google-api-key"
# or: OPENAI_API_KEY, ANTHROPIC_API_KEY (any LiteLLM-supported provider)

from fi.evals import evaluate

result = evaluate(
    prompt="""You are evaluating a customer support response.

    Score 1.0 if the response:
    - Acknowledges the customer's issue clearly
    - Offers a concrete next step or resolution
    - Stays professional and empathetic

    Score 0.5 if it's polite but vague (no clear next step).
    Score 0.0 if it's dismissive, rude, or unhelpful.""",
    output="I understand your frustration with the delayed shipment. I've escalated this to our logistics team and you'll receive a status update within 2 hours.",
    input="My order is 3 weeks late and nobody is responding to my emails.",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)

print(f"Score: {result.score}")
print(f"Why: {result.reason}")

Any LiteLLM model string works: gpt-4o, claude-sonnet-4-20250514, ollama/llama3.2:3b.

Run evaluations on a dataset from the dashboard

Go to app.futureagi.com → Dataset
Use Add Dataset (quick path: upload a CSV)
Click Evaluate → select a metric → Add & Run
Scores appear as a new column alongside your data

Tip

No sample data? Create rows quickly with Generate Synthetic Data.

What you built

You can now score any LLM output using local metrics, Turing models, batch evaluation, custom LLM-as-Judge criteria, and the FutureAGI dashboard.

Local string and similarity metrics in under 1ms with zero credentials
Contradiction detection using a local NLI model
Content quality and safety scoring with Turing evaluation models
Multi-metric batch evaluation with evaluate([...])
Custom evaluation criteria in plain English using LLM-as-Judge
Dashboard-based evaluation on datasets

Questions & Discussion

Running Your First Future AGI Evaluation

Install

Tutorial

Run a local metric (no API key)

Detect contradictions with local NLI

Score with FutureAGI Turing models

Run multiple metrics at once

Write your own evaluation criteria (LLM-as-Judge)

Run evaluations on a dataset from the dashboard

What you built

Next steps

All Built-in Metrics

Custom Eval Metrics

AI-Evaluation SDK

Eval in CI/CD