AutoEval

Auto-generate evaluation pipelines from app descriptions. Pre-built templates for customer support, RAG, code assistants, healthcare, and more.

📝
TL;DR
  • Describe your app in plain English, get a tailored evaluation pipeline
  • 7 pre-built templates: customer_support, rag_system, code_assistant, content_moderation, agent_workflow, healthcare, financial
  • Export configs to YAML/JSON for CI/CD

AutoEval analyzes your app description and recommends the right combination of evaluations and security scanners. It picks metrics based on your app category, risk level, and domain sensitivity — so you don’t have to manually figure out which of the 76+ metrics to use.

Note

Requires pip install ai-evaluation. LLM-powered analysis uses gpt-4o-mini by default (needs OPENAI_API_KEY). Falls back to rule-based analysis if no LLM is available.

Quick Example

from fi.evals.autoeval import AutoEvalPipeline

# Describe your app — AutoEval picks the right metrics and scanners
pipeline = AutoEvalPipeline.from_description(
    "A RAG-based customer support chatbot that retrieves product docs and answers user questions."
)

# See what it chose
print(pipeline.explain())

# Run it
result = pipeline.evaluate({
    "query": "How do I reset my password?",
    "response": "Go to Settings > Security > Reset Password and follow the prompts.",
    "context": "Password reset is available under Settings > Security.",
})

print(f"Passed: {result.passed}")
print(f"Latency: {result.total_latency_ms:.0f}ms")

Creating Pipelines

From a description

The LLM analyzer detects your app category, risk level, and domain. It then selects appropriate metrics and scanners.

from fi.evals.autoeval import AutoEvalPipeline

pipeline = AutoEvalPipeline.from_description(
    "A healthcare chatbot that answers patient questions about medications and appointments. "
    "It retrieves from medical records and must comply with HIPAA."
)

print(pipeline.explain())
# Shows: category=CUSTOMER_SUPPORT, risk=HIGH, domain=HEALTHCARE
# Evals: faithfulness (threshold 0.85), answer_relevancy (0.8)
# Scanners: PIIScanner, SecretsScanner, ToxicityScanner, JailbreakScanner

From a template

Skip the analysis and use a pre-built configuration.

pipeline = AutoEvalPipeline.from_template("rag_system")

From YAML/JSON

Load a previously exported config.

pipeline = AutoEvalPipeline.from_yaml("eval_config.yaml")

Templates

TemplateEvalsScannersRisk
customer_supportanswer_relevancyJailbreak, Toxicity, PIIMedium
rag_systemfaithfulness, groundedness, answer_relevancyJailbreakMedium
code_assistantanswer_relevancyCodeInjection, Secrets, JailbreakMedium
content_moderationToxicity, Bias, InvisibleChar, MaliciousURLHigh
agent_workflowaction_safety, reasoning_qualityJailbreak, CodeInjectionHigh
healthcarefaithfulness, answer_relevancyPII, Secrets, Toxicity, JailbreakHigh
financialfactual_consistency, answer_relevancyPII, Secrets, JailbreakHigh
from fi.evals.autoeval import list_templates, get_template

# See all templates
for name, description in list_templates().items():
    print(f"{name}: {description}")

# Get a template config
config = get_template("healthcare")

Customizing a Pipeline

Add, remove, or adjust metrics after creation.

from fi.evals.autoeval import AutoEvalPipeline, EvalConfig, ScannerConfig

pipeline = AutoEvalPipeline.from_template("rag_system")

# Add a metric
pipeline.add(EvalConfig(name="toxicity", threshold=0.8, weight=1.5))

# Add a scanner
pipeline.add(ScannerConfig(name="PIIScanner", action="redact"))

# Adjust thresholds
pipeline.set_threshold("faithfulness", 0.9)

# Disable a metric temporarily
pipeline.disable("groundedness")

# Remove a metric
pipeline.remove("answer_relevancy")

Running Evaluations

result = pipeline.evaluate({
    "query": "What are the side effects?",
    "response": "Common side effects include headache and nausea.",
    "context": "Side effects: headache, nausea, dizziness.",
})

print(result.passed)            # bool — all checks passed?
print(result.scan_result)       # scanner results (blocking, run first)
print(result.eval_result)       # evaluation results
print(result.metric_results)    # per-metric breakdown
print(result.total_latency_ms)  # total time

Scanners run first. If any scanner fails (e.g. PII detected), the pipeline can block before evaluations run.

Exporting Configs

Save pipeline configs for version control or CI/CD.

# Export
pipeline.export_yaml("eval_config.yaml")
pipeline.export_json("eval_config.json")

# Import
from fi.evals.autoeval import load_yaml, load_json
config = load_yaml("eval_config.yaml")
pipeline = AutoEvalPipeline.from_config(config)

App Analysis

Under the hood, from_description() uses an AppAnalyzer that classifies your app.

from fi.evals.autoeval import AppAnalyzer

analyzer = AppAnalyzer(model="gpt-4o-mini")
analysis = analyzer.analyze("A code review bot that suggests fixes for Python code")

print(analysis.category)            # AppCategory.CODE_ASSISTANT
print(analysis.risk_level)          # RiskLevel.MEDIUM
print(analysis.domain_sensitivity)  # DomainSensitivity.GENERAL
print(analysis.confidence)          # 0.85
print(analysis.detected_features)   # ["code_generation", "code_review"]

Categories: QUESTION_ANSWERING, RAG_SYSTEM, CUSTOMER_SUPPORT, CODE_ASSISTANT, CONTENT_MODERATION, AGENT_WORKFLOW, and more.

Was this page helpful?

Questions & Discussion