AutoEval
Auto-generate evaluation pipelines from app descriptions. Pre-built templates for customer support, RAG, code assistants, healthcare, and more.
- Describe your app in plain English, get a tailored evaluation pipeline
- 7 pre-built templates: customer_support, rag_system, code_assistant, content_moderation, agent_workflow, healthcare, financial
- Export configs to YAML/JSON for CI/CD
AutoEval analyzes your app description and recommends the right combination of evaluations and security scanners. It picks metrics based on your app category, risk level, and domain sensitivity — so you don’t have to manually figure out which of the 76+ metrics to use.
Note
Requires pip install ai-evaluation. LLM-powered analysis uses gpt-4o-mini by default (needs OPENAI_API_KEY). Falls back to rule-based analysis if no LLM is available.
Quick Example
from fi.evals.autoeval import AutoEvalPipeline
# Describe your app — AutoEval picks the right metrics and scanners
pipeline = AutoEvalPipeline.from_description(
"A RAG-based customer support chatbot that retrieves product docs and answers user questions."
)
# See what it chose
print(pipeline.explain())
# Run it
result = pipeline.evaluate({
"query": "How do I reset my password?",
"response": "Go to Settings > Security > Reset Password and follow the prompts.",
"context": "Password reset is available under Settings > Security.",
})
print(f"Passed: {result.passed}")
print(f"Latency: {result.total_latency_ms:.0f}ms")
Creating Pipelines
From a description
The LLM analyzer detects your app category, risk level, and domain. It then selects appropriate metrics and scanners.
from fi.evals.autoeval import AutoEvalPipeline
pipeline = AutoEvalPipeline.from_description(
"A healthcare chatbot that answers patient questions about medications and appointments. "
"It retrieves from medical records and must comply with HIPAA."
)
print(pipeline.explain())
# Shows: category=CUSTOMER_SUPPORT, risk=HIGH, domain=HEALTHCARE
# Evals: faithfulness (threshold 0.85), answer_relevancy (0.8)
# Scanners: PIIScanner, SecretsScanner, ToxicityScanner, JailbreakScanner
From a template
Skip the analysis and use a pre-built configuration.
pipeline = AutoEvalPipeline.from_template("rag_system")
From YAML/JSON
Load a previously exported config.
pipeline = AutoEvalPipeline.from_yaml("eval_config.yaml")
Templates
| Template | Evals | Scanners | Risk |
|---|---|---|---|
customer_support | answer_relevancy | Jailbreak, Toxicity, PII | Medium |
rag_system | faithfulness, groundedness, answer_relevancy | Jailbreak | Medium |
code_assistant | answer_relevancy | CodeInjection, Secrets, Jailbreak | Medium |
content_moderation | — | Toxicity, Bias, InvisibleChar, MaliciousURL | High |
agent_workflow | action_safety, reasoning_quality | Jailbreak, CodeInjection | High |
healthcare | faithfulness, answer_relevancy | PII, Secrets, Toxicity, Jailbreak | High |
financial | factual_consistency, answer_relevancy | PII, Secrets, Jailbreak | High |
from fi.evals.autoeval import list_templates, get_template
# See all templates
for name, description in list_templates().items():
print(f"{name}: {description}")
# Get a template config
config = get_template("healthcare")
Customizing a Pipeline
Add, remove, or adjust metrics after creation.
from fi.evals.autoeval import AutoEvalPipeline, EvalConfig, ScannerConfig
pipeline = AutoEvalPipeline.from_template("rag_system")
# Add a metric
pipeline.add(EvalConfig(name="toxicity", threshold=0.8, weight=1.5))
# Add a scanner
pipeline.add(ScannerConfig(name="PIIScanner", action="redact"))
# Adjust thresholds
pipeline.set_threshold("faithfulness", 0.9)
# Disable a metric temporarily
pipeline.disable("groundedness")
# Remove a metric
pipeline.remove("answer_relevancy")
Running Evaluations
result = pipeline.evaluate({
"query": "What are the side effects?",
"response": "Common side effects include headache and nausea.",
"context": "Side effects: headache, nausea, dizziness.",
})
print(result.passed) # bool — all checks passed?
print(result.scan_result) # scanner results (blocking, run first)
print(result.eval_result) # evaluation results
print(result.metric_results) # per-metric breakdown
print(result.total_latency_ms) # total time
Scanners run first. If any scanner fails (e.g. PII detected), the pipeline can block before evaluations run.
Exporting Configs
Save pipeline configs for version control or CI/CD.
# Export
pipeline.export_yaml("eval_config.yaml")
pipeline.export_json("eval_config.json")
# Import
from fi.evals.autoeval import load_yaml, load_json
config = load_yaml("eval_config.yaml")
pipeline = AutoEvalPipeline.from_config(config)
App Analysis
Under the hood, from_description() uses an AppAnalyzer that classifies your app.
from fi.evals.autoeval import AppAnalyzer
analyzer = AppAnalyzer(model="gpt-4o-mini")
analysis = analyzer.analyze("A code review bot that suggests fixes for Python code")
print(analysis.category) # AppCategory.CODE_ASSISTANT
print(analysis.risk_level) # RiskLevel.MEDIUM
print(analysis.domain_sensitivity) # DomainSensitivity.GENERAL
print(analysis.confidence) # 0.85
print(analysis.detected_features) # ["code_generation", "code_review"]
Categories: QUESTION_ANSWERING, RAG_SYSTEM, CUSTOMER_SUPPORT, CODE_ASSISTANT, CONTENT_MODERATION, AGENT_WORKFLOW, and more.