Evaluations
Evaluate LLM outputs with 76+ local metrics, cloud Turing models, or custom LLM-as-Judge criteria. Part of the ai-evaluation Python package.
- One function, three engines: local heuristics (
<1ms), cloud Turing (~1-3s), or LLM-as-Judge (~2-5s) pip install ai-evaluation— 76+ local metrics work without an API key- Cloud evals and LLM judges need
FI_API_KEY+ a model parameter
For the full platform guide on evaluations, see Evaluation docs. The ai-evaluation package gives you a single evaluate() function that routes to the right engine based on the metric you pick and whether you pass a model. Local metrics run in under a millisecond with no API key. Cloud and LLM-as-Judge evals need network access but handle subjective quality judgments that heuristics can’t.
from fi.evals import evaluate
# Local metric — runs instantly, no API key needed
result = evaluate("contains", output="Hello world", keyword="Hello")
print(result.score) # 1.0
print(result.passed) # True
print(result.reason) # "Keyword 'Hello' found"
# Cloud metric — needs model parameter
result = evaluate("toxicity", output="You're awesome!", model="turing_flash")
print(result.score) # 1.0
print(result.passed) # True
# LLM-as-Judge — custom criteria, any LiteLLM model
result = evaluate(
prompt="Rate helpfulness from 0 to 1",
output="Here are 3 steps to fix that...",
engine="llm",
model="gemini/gemini-2.5-flash",
)
print(result.score) # 0.9
How Engine Routing Works
The evaluate() function picks an engine automatically:
| You pass | Engine used | Speed | API key needed? |
|---|---|---|---|
| Metric name only | Local heuristic | <1ms | No |
Metric + model="turing_flash" | Cloud (Turing) | ~1-3s | Yes |
prompt= + engine="llm" + model | LLM-as-Judge | ~2-5s | Model provider key |
Metric + model= + augment=True | Local + LLM refinement | ~2-5s | Model provider key |
You can force an engine with engine="local", engine="turing", or engine="llm".
What’s Available
Running Evaluations
Full API reference for the core function — parameters, return types, engine routing, batch eval.
Metrics Reference
Browse all 76+ local metrics by category: string, JSON, similarity, hallucination, RAG, agents, guardrails.
Cloud Evals
100+ pre-built Turing templates for tone, toxicity, bias, factual accuracy, and more.
LLM-as-Judge
Define custom evaluation criteria and run them with any LiteLLM-supported model.
Streaming Eval
Evaluate LLM output token-by-token in real time with early stopping.
Feedback Loops
Submit corrections, calibrate thresholds, store feedback in ChromaDB.
Distributed Evaluator
Run evals at scale with ThreadPool, Celery, Ray, or Temporal backends.
AutoEval
Describe your app, get a tailored eval pipeline. 7 pre-built templates.
Guardrails
14 guard models, 14 scanners, gateway routing, and session management.
Local & Hybrid
Run 72+ metrics locally with zero API calls. Ollama for offline LLM scoring.
OpenTelemetry
Trace LLM calls, track costs, attach eval scores to spans.
Code Security
AST-based vulnerability detection for AI-generated code. 15 detectors, 4 eval modes.
Choosing the Right Approach
| You want to… | Use |
|---|---|
| Check if output contains a keyword | Local metric — evaluate("contains", ...) |
| Detect hallucinations in RAG output | Local metric — evaluate("faithfulness", ...) |
| Score tone or toxicity with a pretrained model | Cloud eval — evaluate("toxicity", model="turing_flash") |
| Evaluate with your own criteria | LLM-as-Judge — evaluate(prompt="...", engine="llm") |
| Evaluate tokens as they stream in | Streaming eval |
| Improve accuracy over time with corrections | Feedback loops |
| Run evals at scale across workers | Distributed evaluator |
| Auto-pick metrics for your app type | AutoEval |
| Block unsafe LLM inputs/outputs | Guardrails |
| Run evals offline, no API key | Local & Hybrid |
| Trace evals with OpenTelemetry | OpenTelemetry |
| Scan AI-generated code for vulnerabilities | Code Security |