Running Evaluations
Run evaluations with the evaluate() function — local heuristics, cloud Turing, or LLM-as-Judge, auto-routed based on your inputs.
from fi.evals import evaluate— one function for all eval types- Returns
EvalResultwith score, passed, reason, and latency - Pass a list of eval names to batch multiple evals in one call
The evaluate() function is the main entry point for running evaluations. It accepts a metric name (or list), your inputs as keyword arguments, and optionally a model. The engine is selected automatically based on what you pass.
Note
Requires pip install ai-evaluation. Local metrics work without an API key. Cloud and LLM evals need FI_API_KEY and FI_SECRET_KEY.
Quick Examples
Local metric (no API key needed)
from fi.evals import evaluate
result = evaluate("contains", output="Hello world", keyword="Hello")
print(result.eval_name) # "contains"
print(result.score) # 1.0
print(result.passed) # True
print(result.reason) # "Keyword 'Hello' found"
print(result.latency_ms) # 0.73
Cloud eval (needs API key + model)
from fi.evals import evaluate
result = evaluate(
"toxicity",
output="You're doing a great job!",
model="turing_flash",
)
print(result.score) # 1.0
print(result.passed) # True
print(result.reason) # "This evaluation is given as the content fully follows..."
LLM-as-Judge (custom criteria)
from fi.evals import evaluate
result = evaluate(
prompt="Rate how helpful this response is from 0 to 1. A helpful response directly answers the question with actionable steps.",
output="Here are 3 steps to fix the issue: 1. Check your config...",
query="How do I fix the login error?",
engine="llm",
model="gemini/gemini-2.5-flash",
)
print(result.score) # 0.9
print(result.reason) # '{"score": 0.9, "reason": "Provides structured steps..."}'
Batch evaluation
from fi.evals import evaluate
results = evaluate(
["contains", "one_line", "is_json"],
output="Hello world",
keyword="Hello",
)
for r in results:
print(f"{r.eval_name}: score={r.score}, passed={r.passed}")
# contains: score=1.0, passed=True
# one_line: score=1.0, passed=True
# is_json: score=0.0, passed=False
Warning
Don’t mix local and cloud metrics in the same batch call. If you pass model="turing_flash", only cloud metrics will return results — local metrics will return score=None. Run them separately instead.
Function Signature
def evaluate(
eval_name: str | list[str] | None = None,
*,
prompt: str | None = None,
engine: str | None = None,
model: str | None = None,
augment: bool | None = None,
config: dict | None = None,
generate_prompt: bool = False,
feedback_store: Any | None = None,
fi_api_key: str | None = None,
fi_secret_key: str | None = None,
fi_base_url: str | None = None,
**inputs,
) -> EvalResult | BatchResult
Parameters
Metric name or list of metric names. Use a string for local/cloud metrics (e.g. "toxicity", "contains"). Pass a list for batch evaluation. Set to None when using LLM-as-Judge with a custom prompt.
Custom evaluation criteria for LLM-as-Judge mode. Use {field_name} placeholders to reference input fields. Requires engine="llm" and a model.
Force a specific engine. Options: "local", "turing", "llm". If omitted, the engine is selected automatically based on the metric and model.
Model to use for cloud or LLM evals. For Turing: "turing_flash", "turing_small", "turing_large". For LLM-as-Judge: any LiteLLM model string like "gemini/gemini-2.5-flash", "gpt-4o", "claude-sonnet-4-20250514", "ollama/llama3.2:3b".
Run local heuristic first, then refine with an LLM. Requires model to be set. Supported on: faithfulness, hallucination_score, task_completion, action_safety, reasoning_quality, claim_support, factual_consistency.
Metric-specific configuration. For example, {"rouge_type": "rougeL"} for ROUGE score or {"similarity_method": "cosine"} for embedding similarity.
Auto-generate grading criteria from a plain English description. When True, the prompt parameter is treated as a description and a detailed rubric is generated from it. Generated criteria are cached per session.
A feedback store instance for recording corrections and calibrating thresholds. See Feedback Loops.
Override the FI_API_KEY environment variable for this call.
Override the FI_SECRET_KEY environment variable for this call.
The data to evaluate. Common fields:
| Field | Used by |
|---|---|
output | Almost all metrics — the LLM output being evaluated |
query / input | Metrics that need the original user query |
context / contexts | RAG metrics — the retrieved context (string or list) |
expected_output / ground_truth | Similarity and correctness metrics |
keyword | String matching metrics (contains, contains_all, etc.) |
image_url | Multimodal image evaluation |
audio_url | Audio evaluation |
messages | Conversation evaluation |
Return Types
EvalResult (single eval)
Name of the metric that was run.
Score between 0.0 and 1.0. Some metrics return binary 0 or 1.
Whether the evaluation passed based on the metric’s threshold.
Human-readable explanation of the score.
Execution time in milliseconds.
"completed" or "error".
Error message if status is "error".
Additional info. Contains output_type (e.g. "score", "Pass/Fail") and engine when augmentation is used (e.g. "local+llm").
BatchResult (multiple evals)
Returned when eval_name is a list. Iterable collection of EvalResult objects.
results = evaluate(["toxicity", "faithfulness"], output="...", model="turing_flash")
# Iterate
for r in results:
print(r.eval_name, r.score)
# Access by name
toxicity = results.get("toxicity")
# Check overall pass rate
print(results.success_rate) # 0.0 to 1.0
# Count
print(len(results)) # 2
Engine Routing
If you don’t set engine explicitly, the function picks one:
- No model passed → local engine (heuristic metrics,
<1ms) model="turing_flash"/"turing_small"/"turing_large"→ Turing cloud engine- Any other model string → LLM-as-Judge engine
augment=True→ local first, then LLM refinement
You can check which engine ran via result.metadata["engine"].
Common Patterns
Error handling
result = evaluate("toxicity", output="test", model="turing_flash")
if result.status == "error":
print(f"Eval failed: {result.error}")
else:
print(f"Score: {result.score}")
Augmented evaluation (local + LLM)
result = evaluate(
"faithfulness",
output="Paris is the capital of France.",
context="France is a European country with Paris as its capital.",
model="gemini/gemini-2.5-flash",
augment=True,
)
print(result.metadata["engine"]) # "local+llm"
Auto-generated grading criteria
result = evaluate(
prompt="Check if the response is empathetic and acknowledges the customer's frustration",
output="I understand this is frustrating. Let me help fix that right away.",
engine="llm",
model="gpt-4o",
generate_prompt=True,
)
# The prompt is expanded into a detailed rubric automatically
Environment Variables
| Variable | Required for | Default |
|---|---|---|
FI_API_KEY | Cloud (Turing) evals | — |
FI_SECRET_KEY | Cloud (Turing) evals | — |
FI_BASE_URL | Custom API endpoint | https://api.futureagi.com |
GOOGLE_API_KEY | Gemini models (LLM judge) | — |
OPENAI_API_KEY | OpenAI models (LLM judge) | — |
ANTHROPIC_API_KEY | Claude models (LLM judge) | — |