Evaluations: Future AGI Python SDK Eval Reference

Evaluate LLM outputs with 76+ local metrics, cloud Turing models, or custom LLM-as-Judge criteria. Part of the ai-evaluation Python package.

📝

TL;DR

One function, three engines: local heuristics (<1ms), cloud Turing (~1-3s), or LLM-as-Judge (~2-5s)
pip install ai-evaluation — 76+ local metrics work without an API key
Cloud evals and LLM judges need FI_API_KEY + a model parameter

For the full platform guide on evaluations, see Evaluation docs. The ai-evaluation package gives you a single evaluate() function that routes to the right engine based on the metric you pick and whether you pass a model. Local metrics run in under a millisecond with no API key. Cloud and LLM-as-Judge evals need network access but handle subjective quality judgments that heuristics can’t.

from fi.evals import evaluate

# Local metric — runs instantly, no API key needed
result = evaluate("contains", output="Hello world", keyword="Hello")
print(result.score)    # 1.0
print(result.passed)   # True
print(result.reason)   # "Keyword 'Hello' found"

# Cloud metric — needs model parameter
result = evaluate("toxicity", output="You're awesome!", model="turing_flash")
print(result.score)    # 1.0
print(result.passed)   # True

# LLM-as-Judge — custom criteria, any LiteLLM model
result = evaluate(
    prompt="Rate helpfulness from 0 to 1",
    output="Here are 3 steps to fix that...",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)
print(result.score)   # 0.9

How Engine Routing Works

The evaluate() function picks an engine automatically:

You pass	Engine used	Speed	API key needed?
Metric name only	Local heuristic	`<`1ms	No
Metric + `model="turing_flash"`	Cloud (Turing)	~1-3s	Yes
`prompt=` + `engine="llm"` + model	LLM-as-Judge	~2-5s	Model provider key
Metric + `model=` + `augment=True`	Local + LLM refinement	~2-5s	Model provider key

You can force an engine with engine="local", engine="turing", or engine="llm".

Choosing the Right Approach

You want to…	Use
Check if output contains a keyword	Local metric — `evaluate("contains", ...)`
Detect hallucinations in RAG output	Local metric — `evaluate("faithfulness", ...)`
Score tone or toxicity with a pretrained model	Cloud eval — `evaluate("toxicity", model="turing_flash")`
Evaluate with your own criteria	LLM-as-Judge — `evaluate(prompt="...", engine="llm")`
Evaluate tokens as they stream in	Streaming eval
Improve accuracy over time with corrections	Feedback loops
Run evals at scale across workers	Distributed evaluator
Auto-pick metrics for your app type	AutoEval
Block unsafe LLM inputs/outputs	Guardrails
Run evals offline, no API key	Local & Hybrid
Trace evals with OpenTelemetry	OpenTelemetry
Scan AI-generated code for vulnerabilities	Code Security

Was this page helpful?

Questions & Discussion

Evaluations: Future AGI Python SDK Eval Reference

How Engine Routing Works

What’s Available

Running Evaluations

Metrics Reference

Cloud Evals

LLM-as-Judge

Streaming Eval

Feedback Loops

Distributed Evaluator

AutoEval

Guardrails

Local & Hybrid

OpenTelemetry

Code Security

Choosing the Right Approach