AutoEval: Auto-Generate Evaluation Pipelines with Future AGI

Auto-generate evaluation pipelines from app descriptions. Pre-built templates for customer support, RAG, code assistants, healthcare, and more.

📝

TL;DR

Describe your app in plain English, get a tailored evaluation pipeline
7 pre-built templates: customer_support, rag_system, code_assistant, content_moderation, agent_workflow, healthcare, financial
Export configs to YAML/JSON for CI/CD

AutoEval analyzes your app description and recommends the right combination of evaluations and security scanners. It picks metrics based on your app category, risk level, and domain sensitivity — so you don’t have to manually figure out which of the 76+ metrics to use.

Note

Requires pip install ai-evaluation. LLM-powered analysis uses gpt-4o-mini by default (needs OPENAI_API_KEY). Falls back to rule-based analysis if no LLM is available.

Quick Example

from fi.evals.autoeval import AutoEvalPipeline

# Describe your app — AutoEval picks the right metrics and scanners
pipeline = AutoEvalPipeline.from_description(
    "A RAG-based customer support chatbot that retrieves product docs and answers user questions."
)

# See what it chose
print(pipeline.explain())

# Run it
result = pipeline.evaluate({
    "query": "How do I reset my password?",
    "response": "Go to Settings > Security > Reset Password and follow the prompts.",
    "context": "Password reset is available under Settings > Security.",
})

print(f"Passed: {result.passed}")
print(f"Latency: {result.total_latency_ms:.0f}ms")

Creating Pipelines

From a description

The LLM analyzer detects your app category, risk level, and domain. It then selects appropriate metrics and scanners.

from fi.evals.autoeval import AutoEvalPipeline

pipeline = AutoEvalPipeline.from_description(
    "A healthcare chatbot that answers patient questions about medications and appointments. "
    "It retrieves from medical records and must comply with HIPAA."
)

print(pipeline.explain())
# Shows: category=CUSTOMER_SUPPORT, risk=HIGH, domain=HEALTHCARE
# Evals: faithfulness (threshold 0.85), answer_relevancy (0.8)
# Scanners: PIIScanner, SecretsScanner, ToxicityScanner, JailbreakScanner

From a template

Skip the analysis and use a pre-built configuration.

pipeline = AutoEvalPipeline.from_template("rag_system")

From YAML/JSON

Load a previously exported config.

pipeline = AutoEvalPipeline.from_yaml("eval_config.yaml")

Templates

Template	Evals	Scanners	Risk
`customer_support`	answer_relevancy	Jailbreak, Toxicity, PII	Medium
`rag_system`	faithfulness, groundedness, answer_relevancy	Jailbreak	Medium
`code_assistant`	answer_relevancy	CodeInjection, Secrets, Jailbreak	Medium
`content_moderation`	—	Toxicity, Bias, InvisibleChar, MaliciousURL	High
`agent_workflow`	action_safety, reasoning_quality	Jailbreak, CodeInjection	High
`healthcare`	faithfulness, answer_relevancy	PII, Secrets, Toxicity, Jailbreak	High
`financial`	factual_consistency, answer_relevancy	PII, Secrets, Jailbreak	High

from fi.evals.autoeval import list_templates, get_template

# See all templates
for name, description in list_templates().items():
    print(f"{name}: {description}")

# Get a template config
config = get_template("healthcare")

Customizing a Pipeline

Add, remove, or adjust metrics after creation.

from fi.evals.autoeval import AutoEvalPipeline, EvalConfig, ScannerConfig

pipeline = AutoEvalPipeline.from_template("rag_system")

# Add a metric
pipeline.add(EvalConfig(name="toxicity", threshold=0.8, weight=1.5))

# Add a scanner
pipeline.add(ScannerConfig(name="PIIScanner", action="redact"))

# Adjust thresholds
pipeline.set_threshold("faithfulness", 0.9)

# Disable a metric temporarily
pipeline.disable("groundedness")

# Remove a metric
pipeline.remove("answer_relevancy")

Running Evaluations

result = pipeline.evaluate({
    "query": "What are the side effects?",
    "response": "Common side effects include headache and nausea.",
    "context": "Side effects: headache, nausea, dizziness.",
})

print(result.passed)            # bool — all checks passed?
print(result.scan_result)       # scanner results (blocking, run first)
print(result.eval_result)       # evaluation results
print(result.metric_results)    # per-metric breakdown
print(result.total_latency_ms)  # total time

Scanners run first. If any scanner fails (e.g. PII detected), the pipeline can block before evaluations run.

Exporting Configs

Save pipeline configs for version control or CI/CD.

# Export
pipeline.export_yaml("eval_config.yaml")
pipeline.export_json("eval_config.json")

# Import
from fi.evals.autoeval import load_yaml, load_json
config = load_yaml("eval_config.yaml")
pipeline = AutoEvalPipeline.from_config(config)

App Analysis

Under the hood, from_description() uses an AppAnalyzer that classifies your app.

from fi.evals.autoeval import AppAnalyzer

analyzer = AppAnalyzer(model="gpt-4o-mini")
analysis = analyzer.analyze("A code review bot that suggests fixes for Python code")

print(analysis.category)            # AppCategory.CODE_ASSISTANT
print(analysis.risk_level)          # RiskLevel.MEDIUM
print(analysis.domain_sensitivity)  # DomainSensitivity.GENERAL
print(analysis.confidence)          # 0.85
print(analysis.detected_features)   # ["code_generation", "code_review"]

Categories: QUESTION_ANSWERING, RAG_SYSTEM, CUSTOMER_SUPPORT, CODE_ASSISTANT, CONTENT_MODERATION, AGENT_WORKFLOW, and more.

Questions & Discussion

AutoEval: Auto-Generate Evaluation Pipelines with Future AGI

Quick Example

Creating Pipelines

From a description

From a template

From YAML/JSON

Templates

Customizing a Pipeline

Running Evaluations

Exporting Configs

App Analysis

Distributed Evaluator

Metrics Reference

Guardrails

Running Evaluations

Quick Example

Creating Pipelines

From a description

From a template

From YAML/JSON

Templates

Customizing a Pipeline

Running Evaluations

Exporting Configs

App Analysis

Related

Distributed Evaluator

Metrics Reference

Guardrails

Running Evaluations