Running Evaluations

Run evaluations with the evaluate() function — local heuristics, cloud Turing, or LLM-as-Judge, auto-routed based on your inputs.

📝
TL;DR
  • from fi.evals import evaluate — one function for all eval types
  • Returns EvalResult with score, passed, reason, and latency
  • Pass a list of eval names to batch multiple evals in one call

The evaluate() function is the main entry point for running evaluations. It accepts a metric name (or list), your inputs as keyword arguments, and optionally a model. The engine is selected automatically based on what you pass.

Note

Requires pip install ai-evaluation. Local metrics work without an API key. Cloud and LLM evals need FI_API_KEY and FI_SECRET_KEY.

Quick Examples

Local metric (no API key needed)

from fi.evals import evaluate

result = evaluate("contains", output="Hello world", keyword="Hello")

print(result.eval_name)   # "contains"
print(result.score)       # 1.0
print(result.passed)      # True
print(result.reason)      # "Keyword 'Hello' found"
print(result.latency_ms)  # 0.73

Cloud eval (needs API key + model)

from fi.evals import evaluate

result = evaluate(
    "toxicity",
    output="You're doing a great job!",
    model="turing_flash",
)

print(result.score)   # 1.0
print(result.passed)  # True
print(result.reason)  # "This evaluation is given as the content fully follows..."

LLM-as-Judge (custom criteria)

from fi.evals import evaluate

result = evaluate(
    prompt="Rate how helpful this response is from 0 to 1. A helpful response directly answers the question with actionable steps.",
    output="Here are 3 steps to fix the issue: 1. Check your config...",
    query="How do I fix the login error?",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)

print(result.score)   # 0.9
print(result.reason)  # '{"score": 0.9, "reason": "Provides structured steps..."}'

Batch evaluation

from fi.evals import evaluate

results = evaluate(
    ["contains", "one_line", "is_json"],
    output="Hello world",
    keyword="Hello",
)

for r in results:
    print(f"{r.eval_name}: score={r.score}, passed={r.passed}")
# contains: score=1.0, passed=True
# one_line: score=1.0, passed=True
# is_json: score=0.0, passed=False

Warning

Don’t mix local and cloud metrics in the same batch call. If you pass model="turing_flash", only cloud metrics will return results — local metrics will return score=None. Run them separately instead.

Function Signature

def evaluate(
    eval_name: str | list[str] | None = None,
    *,
    prompt: str | None = None,
    engine: str | None = None,
    model: str | None = None,
    augment: bool | None = None,
    config: dict | None = None,
    generate_prompt: bool = False,
    feedback_store: Any | None = None,
    fi_api_key: str | None = None,
    fi_secret_key: str | None = None,
    fi_base_url: str | None = None,
    **inputs,
) -> EvalResult | BatchResult

Parameters

eval_name str | list[str] | None Optional

Metric name or list of metric names. Use a string for local/cloud metrics (e.g. "toxicity", "contains"). Pass a list for batch evaluation. Set to None when using LLM-as-Judge with a custom prompt.

prompt str | None Optional

Custom evaluation criteria for LLM-as-Judge mode. Use {field_name} placeholders to reference input fields. Requires engine="llm" and a model.

engine str | None Optional

Force a specific engine. Options: "local", "turing", "llm". If omitted, the engine is selected automatically based on the metric and model.

model str | None Optional

Model to use for cloud or LLM evals. For Turing: "turing_flash", "turing_small", "turing_large". For LLM-as-Judge: any LiteLLM model string like "gemini/gemini-2.5-flash", "gpt-4o", "claude-sonnet-4-20250514", "ollama/llama3.2:3b".

augment bool | None Optional

Run local heuristic first, then refine with an LLM. Requires model to be set. Supported on: faithfulness, hallucination_score, task_completion, action_safety, reasoning_quality, claim_support, factual_consistency.

config dict | None Optional

Metric-specific configuration. For example, {"rouge_type": "rougeL"} for ROUGE score or {"similarity_method": "cosine"} for embedding similarity.

generate_prompt bool Optional Defaults to False

Auto-generate grading criteria from a plain English description. When True, the prompt parameter is treated as a description and a detailed rubric is generated from it. Generated criteria are cached per session.

feedback_store Any | None Optional

A feedback store instance for recording corrections and calibrating thresholds. See Feedback Loops.

fi_api_key str | None Optional

Override the FI_API_KEY environment variable for this call.

fi_secret_key str | None Optional

Override the FI_SECRET_KEY environment variable for this call.

**inputs keyword arguments Optional

The data to evaluate. Common fields:

FieldUsed by
outputAlmost all metrics — the LLM output being evaluated
query / inputMetrics that need the original user query
context / contextsRAG metrics — the retrieved context (string or list)
expected_output / ground_truthSimilarity and correctness metrics
keywordString matching metrics (contains, contains_all, etc.)
image_urlMultimodal image evaluation
audio_urlAudio evaluation
messagesConversation evaluation

Return Types

EvalResult (single eval)

eval_name str Required

Name of the metric that was run.

score float | None

Score between 0.0 and 1.0. Some metrics return binary 0 or 1.

passed bool | None

Whether the evaluation passed based on the metric’s threshold.

reason str

Human-readable explanation of the score.

latency_ms float

Execution time in milliseconds.

status str

"completed" or "error".

error str | None

Error message if status is "error".

metadata dict

Additional info. Contains output_type (e.g. "score", "Pass/Fail") and engine when augmentation is used (e.g. "local+llm").

BatchResult (multiple evals)

Returned when eval_name is a list. Iterable collection of EvalResult objects.

results = evaluate(["toxicity", "faithfulness"], output="...", model="turing_flash")

# Iterate
for r in results:
    print(r.eval_name, r.score)

# Access by name
toxicity = results.get("toxicity")

# Check overall pass rate
print(results.success_rate)  # 0.0 to 1.0

# Count
print(len(results))  # 2

Engine Routing

If you don’t set engine explicitly, the function picks one:

  1. No model passed → local engine (heuristic metrics, <1ms)
  2. model="turing_flash" / "turing_small" / "turing_large" → Turing cloud engine
  3. Any other model string → LLM-as-Judge engine
  4. augment=True → local first, then LLM refinement

You can check which engine ran via result.metadata["engine"].

Common Patterns

Error handling

result = evaluate("toxicity", output="test", model="turing_flash")

if result.status == "error":
    print(f"Eval failed: {result.error}")
else:
    print(f"Score: {result.score}")

Augmented evaluation (local + LLM)

result = evaluate(
    "faithfulness",
    output="Paris is the capital of France.",
    context="France is a European country with Paris as its capital.",
    model="gemini/gemini-2.5-flash",
    augment=True,
)

print(result.metadata["engine"])  # "local+llm"

Auto-generated grading criteria

result = evaluate(
    prompt="Check if the response is empathetic and acknowledges the customer's frustration",
    output="I understand this is frustrating. Let me help fix that right away.",
    engine="llm",
    model="gpt-4o",
    generate_prompt=True,
)
# The prompt is expanded into a detailed rubric automatically

Environment Variables

VariableRequired forDefault
FI_API_KEYCloud (Turing) evals
FI_SECRET_KEYCloud (Turing) evals
FI_BASE_URLCustom API endpointhttps://api.futureagi.com
GOOGLE_API_KEYGemini models (LLM judge)
OPENAI_API_KEYOpenAI models (LLM judge)
ANTHROPIC_API_KEYClaude models (LLM judge)
Was this page helpful?

Questions & Discussion