evaluate() function is the single entrypoint for all evaluations in the ai-evaluation SDK. It automatically routes to the right engine based on what you pass.
Installation
Quick Start
How Engine Routing Works
You don’t need to think about engines —evaluate() figures it out:
| What You Pass | Engine Used | Speed |
|---|---|---|
Just a metric name ("contains", "faithfulness") | Local | <1ms |
Metric name + model="turing_flash" | Cloud (Turing) | ~1-3s |
prompt= + engine="llm" + any model | LLM Judge | ~2-5s |
Metric + model= + augment=True | Local + LLM | ~2-5s |
engine="local", engine="turing", or engine="llm".
Function Signature
EvalResult
Every evaluation returns anEvalResult:
BatchResult
When you pass a list of eval names, you get aBatchResult:
Local Metrics (No API Key)
These run instantly on your machine. No network calls, no API keys.String Checks
Similarity Metrics
Scores are continuous 0.0–1.0:Hallucination Detection
Uses NLI model when available (pip install ai-evaluation[nli]), falls back to heuristics:
RAG Metrics
For evaluating Retrieval-Augmented Generation pipelines:Guardrails (Security)
Block attacks in <10ms:Complete Metrics Reference
Every metric below works throughevaluate(). Scores are 0.0—1.0 unless noted otherwise. Binary metrics return exactly 0 or 1.
String Checks (16 metrics)
String Checks (16 metrics)
Deterministic, regex-based checks. All scores are binary (0 or 1). No API key or model required.
Examples:
| Metric | What It Checks | Required Inputs |
|---|---|---|
contains | Output contains a keyword (case-insensitive) | output, keyword |
contains_all | Output contains ALL keywords | output, keywords (list) |
contains_any | Output contains ANY keyword | output, keywords (list) |
contains_none | Output contains NONE of the keywords | output, keywords (list) |
contains_email | Output contains an email address | output |
contains_link | Output contains a URL | output |
contains_valid_link | Output contains a reachable URL (makes HTTP request) | output |
is_email | Entire output is a valid email address | output |
one_line | Output is a single line (no newlines) | output |
equals | Exact string match | output, expected_output |
starts_with | Output starts with keyword | output, keyword |
ends_with | Output ends with keyword | output, keyword |
regex | Output matches a regex pattern | output, keyword (pattern) |
length_less_than | Output length < N characters | output, keyword (N as string) |
length_greater_than | Output length > N characters | output, keyword (N as string) |
length_between | Output length within a range | output, config={"min": N, "max": M} |
JSON Metrics (5 metrics)
JSON Metrics (5 metrics)
Validate JSON structure, schema compliance, and syntax. All scores are binary.
Examples:
| Metric | What It Checks | Required Inputs |
|---|---|---|
contains_json | Output contains a JSON object anywhere | output |
is_json | Entire output is valid JSON | output |
json_schema | Output matches a JSON schema | output, expected_output (schema string) |
json_validation | JSON is valid and well-formed with detailed error reporting | output |
json_syntax | JSON syntax correctness (brackets, quotes, commas) | output |
Similarity Metrics (7 metrics)
Similarity Metrics (7 metrics)
Continuous scores from 0.0 to 1.0 measuring textual or semantic similarity. No API key required.
Examples:
| Metric | What It Measures | Required Inputs |
|---|---|---|
bleu_score | BLEU n-gram overlap (standard machine translation metric) | output, expected_output |
rouge_score | ROUGE recall-oriented overlap (for summarization) | output, expected_output |
recall_score | Token-level recall of expected tokens | output, expected_output |
levenshtein_similarity | Edit distance normalized to 0—1 similarity | output, expected_output |
numeric_similarity | Closeness of numeric values | output, expected_output |
embedding_similarity | Cosine similarity of sentence embeddings | output, expected_output |
semantic_list_contains | Whether output semantically matches any item in a list | output, expected_output (list) |
embedding_similarity and semantic_list_contains use sentence-transformers. Install with pip install ai-evaluation[nli].Hallucination Detection (5 metrics)
Hallucination Detection (5 metrics)
Detect fabricated, unsupported, or contradictory claims. Uses a DeBERTa NLI model when available, falls back to heuristics. All support
Examples:
augment=True for LLM refinement.| Metric | What It Measures | Required Inputs |
|---|---|---|
faithfulness | Whether the response is faithful to context (high = faithful) | output, context |
claim_support | Fraction of claims in output supported by context | output, context |
factual_consistency | Factual consistency between output and context | output, context |
contradiction_detection | Detects contradictions (high = no contradictions) | output, context |
hallucination_score | Overall hallucination detection (high = not hallucinated) | output, context |
For best accuracy, install the NLI model:
pip install ai-evaluation[nli]. Without it, a keyword-based heuristic is used with a warning.Function Calling (4 metrics)
Function Calling (4 metrics)
Evaluate LLM function/tool calling accuracy. Output and expected_output should be JSON strings representing function calls.
Examples:
| Metric | What It Measures | Required Inputs |
|---|---|---|
function_name_match | Did the LLM call the correct function? (binary) | output, expected_output |
parameter_validation | Are the function parameters correct? | output, expected_output |
function_call_accuracy | Overall accuracy of the function call (name + params) | output, expected_output |
function_call_exact_match | Exact match of the entire function call JSON | output, expected_output |
Agent Trajectory (7 metrics)
Agent Trajectory (7 metrics)
Evaluate autonomous agent behavior across multi-step trajectories. Output should be a trajectory JSON (list of steps), and expected_output should describe the goal/task.
Examples:
| Metric | What It Measures | Required Inputs |
|---|---|---|
task_completion | Was the task completed successfully? | output (trajectory), expected_output (task) |
step_efficiency | Were the steps efficient (no unnecessary actions)? | output (trajectory), expected_output (task) |
tool_selection_accuracy | Did the agent choose the right tools? | output (trajectory), expected_output (task) |
trajectory_score | Overall trajectory quality | output (trajectory), expected_output (trajectory) |
goal_progress | How much progress was made toward the goal? | output (trajectory), expected_output (task) |
action_safety | Are the agent’s actions safe? Supports augment=True | output (trajectory) |
reasoning_quality | Quality of the agent’s reasoning. Supports augment=True | output (reasoning text) |
RAG Retrieval (8 metrics)
RAG Retrieval (8 metrics)
Evaluate the retrieval step of RAG pipelines. Measures how well retrieved context matches the query and ground truth.
Examples:
| Metric | What It Measures | Required Inputs |
|---|---|---|
context_recall | How much ground truth is covered by retrieved context | output, context (list), input (query), expected_output |
context_precision | Precision of retrieved context relative to query | output, context (list), input (query) |
context_entity_recall | Entity-level recall from context vs. ground truth | output, context (list), expected_output |
noise_sensitivity | How sensitive is the model to irrelevant context? | output, context (list), input (query) |
ndcg | Normalized Discounted Cumulative Gain (ranking quality) | output, context (list), expected_output |
mrr | Mean Reciprocal Rank (position of first relevant result) | output, context (list), expected_output |
precision_at_k | Precision at top-K retrieved chunks | output, context (list), expected_output, config={"k": N} |
recall_at_k | Recall at top-K retrieved chunks | output, context (list), expected_output, config={"k": N} |
RAG Generation (6 metrics)
RAG Generation (6 metrics)
Evaluate the generation step of RAG pipelines. Measures how well the LLM answer uses and stays faithful to the retrieved context.
Examples:
| Metric | What It Measures | Required Inputs |
|---|---|---|
answer_relevancy | Relevance of the answer to the original query | output, context (list), input (query) |
context_utilization | How effectively retrieved context was used | output, context (list) |
context_relevance_to_response | How relevant the context is to the generated response | output, context (list) |
rag_faithfulness | Faithfulness of the answer to retrieved context | output, context (list), input (query) |
rag_faithfulness_with_reference | Faithfulness checked against both context and reference | output, context (list), input (query), expected_output |
groundedness | Whether output is grounded in (supported by) context | output, context (list) |
RAG Advanced (3 metrics)
RAG Advanced (3 metrics)
Advanced RAG capabilities: multi-hop reasoning, source attribution, and citation checking.
Examples:
| Metric | What It Measures | Required Inputs |
|---|---|---|
multi_hop_reasoning | Quality of reasoning across multiple context chunks | output, context (list), input (query) |
source_attribution | Correctness of source citations in the response | output, context (list) |
citation_presence | Whether the response includes citations at all | output, context (list) |
RAG Composite (2 metrics)
RAG Composite (2 metrics)
Aggregate scores that combine multiple RAG sub-metrics into a single number.
Examples:
| Metric | What It Measures | Required Inputs |
|---|---|---|
rag_score | Composite RAG quality (faithfulness + relevancy + groundedness) | output, context (list), input (query) |
rag_score_detailed | Same as rag_score but returns per-sub-metric breakdown in metadata | output, context (list), input (query) |
Structured Output (9 metrics)
Structured Output (9 metrics)
Evaluate JSON, YAML, and other structured outputs for correctness, completeness, and schema compliance.
Examples:
| Metric | What It Measures | Required Inputs |
|---|---|---|
schema_compliance | Full JSON schema validation | output, expected_output (schema dict or string) |
type_compliance | Correct data types for all fields | output, expected_output |
field_completeness | Fraction of expected fields present | output, expected_output |
required_fields | Whether all required fields are present (binary) | output, expected_output (schema with required) |
field_coverage | Field-level coverage (present/total) | output, expected_output |
hierarchy_score | Structural hierarchy similarity (nested objects) | output, expected_output |
tree_edit_distance | Tree edit distance between two structured outputs | output, expected_output |
structured_output_score | Composite score for structured output quality | output, expected_output |
quick_structured_check | Fast binary pass/fail for structure validity | output, expected_output |
Guardrails (4+ metrics via evaluate)
Guardrails (4+ metrics via evaluate)
Fast security scanners accessible through
Examples:For advanced guardrails (scanner pipelines, model-based screening, ensembles), use the scanner API directly:
evaluate(). All run locally in <10ms. For the full scanner pipeline API, see the Guardrails Guide.| Metric | What It Detects | Required Inputs |
|---|---|---|
prompt_injection | Prompt manipulation, DAN attacks, role-play exploits | output |
pii_detection | SSNs, credit cards, emails, phone numbers, and other PII | output |
secret_detection | API keys, passwords, private keys, JWTs, connection strings | output |
sql_injection | SQL injection patterns (DROP, UNION SELECT, etc.) | output |
Cloud Evaluations (Turing Models)
Use Future AGI’s purpose-built evaluation models for production-grade accuracy.Requires
FI_API_KEY and FI_SECRET_KEY. Get them from Admin Settings.turing_flash, turing_small.
LLM-as-Judge (Custom Criteria)
Write your own evaluation criteria and use any LLM as the judge. Works with Gemini, GPT-4, Claude, Llama, or any model supported by LiteLLM.Supported Models
Any LiteLLM model string works:GOOGLE_API_KEY, OPENAI_API_KEY, etc.).
Using Placeholders
Your prompt can reference any input field with{field_name}:
Multimodal Evaluation (Images & Audio)
Pass image or audio URLs directly to the LLM judge. The model sees the actual media, not just the URL text.Requires a vision/audio-capable model like
gemini/gemini-2.5-flash or gpt-4o.Image Evaluation
Comparing Two Images
Audio Evaluation
Supported Media Keys
| Key | Type | Description |
|---|---|---|
image_url | Image | Single image to evaluate |
input_image_url | Image | Reference/input image |
output_image_url | Image | Generated/output image |
audio_url | Audio | Audio file to evaluate |
input_audio_url | Audio | Reference audio |
Auto-Generate Grading Criteria
Don’t want to write a detailed rubric? Describe what you want to evaluate in plain English, and the LLM generates the criteria for you.LLM Augmentation
Run a fast local heuristic first, then have an LLM refine the judgment. Best of both worlds: speed of local metrics + accuracy of LLM judges.How It Works
- Local metric runs first (faithfulness heuristic) → produces initial score + reasoning
- LLM receives the original inputs + the heuristic’s analysis
- LLM produces a refined score with better nuance
Which Metrics Support Augmentation?
Metrics withsupports_llm_judge = True:
faithfulnesshallucination_scoretask_completionaction_safetyreasoning_qualityclaim_supportfactual_consistency
Feedback Loop
Submit corrections when the judge gets it wrong. Future evaluations use your corrections as few-shot examples, improving accuracy over time.Feedback Stores
Two stores are available:| Store | Use Case | Persistence | Search |
|---|---|---|---|
InMemoryFeedbackStore | Unit tests, quick experiments | In-memory only (lost on exit) | Recency-based |
ChromaFeedbackStore | Production, CI pipelines | Persistent (ChromaDB on disk) | Semantic vector search |
Correcting Results
ChromaFeedbackStore requires the feedback extra: pip install ai-evaluation[feedback]. InMemoryFeedbackStore works with the base install.OpenTelemetry Tracing
Wire evaluation scores into your observability stack (Jaeger, Datadog, Grafana).Streaming Evaluation
Monitor LLM output token-by-token in real time. Detect toxicity, PII, jailbreaks, and quality degradation mid-generation with configurable early stopping.Basic Usage
Factory Methods
Use pre-configured evaluators for common scenarios:Early Stop Policies
Control when generation should be halted:Built-in Scorers
| Scorer | Direction | Description |
|---|---|---|
toxicity_scorer | Lower is better | Keyword-based toxicity detection |
safety_scorer | Higher is better | Inverse of toxicity |
pii_scorer | Lower is better | PII pattern detection |
jailbreak_scorer | Lower is better | Jailbreak pattern detection |
coherence_scorer | Higher is better | Text coherence heuristic |
quality_scorer | Higher is better | Combined quality heuristic |
safety_composite_scorer | Higher is better | Weighted: toxicity + PII + jailbreak |
quality_composite_scorer | Higher is better | Weighted: coherence + quality |
Async Streaming (e.g., OpenAI)
StreamingConfig Options
Key configuration fields:| Option | Default | Description |
|---|---|---|
min_chunk_size | 1 | Minimum characters to trigger assessment |
max_chunk_size | 100 | Maximum characters per chunk |
eval_interval_ms | 100 | Minimum milliseconds between assessments |
max_tokens | None | Stop after N tokens |
max_chars | None | Stop after N characters |
enable_early_stop | True | Enable early stopping |
stop_on_first_failure | False | Immediate stop on any failure |
eval_on_sentence_end | True | Also assess at sentence boundaries |
Environment Variables
| Variable | Required For | Description |
|---|---|---|
FI_API_KEY | Cloud evals | Future AGI API key |
FI_SECRET_KEY | Cloud evals | Future AGI secret key |
FI_BASE_URL | Cloud evals | API base URL (default: production) |
GOOGLE_API_KEY | Gemini models | Google AI Studio key |
OPENAI_API_KEY | OpenAI models | OpenAI key |
ANTHROPIC_API_KEY | Claude models | Anthropic key |