Evaluations

Evaluate LLM outputs with 76+ local metrics, cloud Turing models, or custom LLM-as-Judge criteria. Part of the ai-evaluation Python package.

📝
TL;DR
  • One function, three engines: local heuristics (<1ms), cloud Turing (~1-3s), or LLM-as-Judge (~2-5s)
  • pip install ai-evaluation — 76+ local metrics work without an API key
  • Cloud evals and LLM judges need FI_API_KEY + a model parameter

For the full platform guide on evaluations, see Evaluation docs. The ai-evaluation package gives you a single evaluate() function that routes to the right engine based on the metric you pick and whether you pass a model. Local metrics run in under a millisecond with no API key. Cloud and LLM-as-Judge evals need network access but handle subjective quality judgments that heuristics can’t.

from fi.evals import evaluate

# Local metric — runs instantly, no API key needed
result = evaluate("contains", output="Hello world", keyword="Hello")
print(result.score)    # 1.0
print(result.passed)   # True
print(result.reason)   # "Keyword 'Hello' found"

# Cloud metric — needs model parameter
result = evaluate("toxicity", output="You're awesome!", model="turing_flash")
print(result.score)    # 1.0
print(result.passed)   # True

# LLM-as-Judge — custom criteria, any LiteLLM model
result = evaluate(
    prompt="Rate helpfulness from 0 to 1",
    output="Here are 3 steps to fix that...",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)
print(result.score)   # 0.9

How Engine Routing Works

The evaluate() function picks an engine automatically:

You passEngine usedSpeedAPI key needed?
Metric name onlyLocal heuristic<1msNo
Metric + model="turing_flash"Cloud (Turing)~1-3sYes
prompt= + engine="llm" + modelLLM-as-Judge~2-5sModel provider key
Metric + model= + augment=TrueLocal + LLM refinement~2-5sModel provider key

You can force an engine with engine="local", engine="turing", or engine="llm".

What’s Available

Choosing the Right Approach

You want to…Use
Check if output contains a keywordLocal metricevaluate("contains", ...)
Detect hallucinations in RAG outputLocal metricevaluate("faithfulness", ...)
Score tone or toxicity with a pretrained modelCloud evalevaluate("toxicity", model="turing_flash")
Evaluate with your own criteriaLLM-as-Judgeevaluate(prompt="...", engine="llm")
Evaluate tokens as they stream inStreaming eval
Improve accuracy over time with correctionsFeedback loops
Run evals at scale across workersDistributed evaluator
Auto-pick metrics for your app typeAutoEval
Block unsafe LLM inputs/outputsGuardrails
Run evals offline, no API keyLocal & Hybrid
Trace evals with OpenTelemetryOpenTelemetry
Scan AI-generated code for vulnerabilitiesCode Security
Was this page helpful?

Questions & Discussion