The Problem
You are launching a new AI product — a RAG-powered healthcare chatbot. Your PM asks: “What should we test?” You do not want to manually pick from 50+ metrics and figure out thresholds. Instead, describe your app and let AutoEval build the right pipeline for you.What You Will Learn
- How to generate a pipeline from a plain-English description
- How to use pre-built templates for common application categories
- How to run the pipeline against real inputs
- How to customize thresholds and add/remove metrics
- How to export the config as YAML or JSON for CI/CD integration
Prerequisites
Step 1: Describe Your App, Get a Test Plan
Pass a natural language description of your application toAutoEvalPipeline.from_description(). It analyzes the description and selects appropriate metrics, scanners, and thresholds.
Step 2: Run the Pipeline
Build a simpler pipeline for testing and run it against real inputs. Thepipeline.evaluate() method runs all configured metrics and scanners in one call.
- Test 1 passes all checks
- Test 2 is blocked by the JailbreakScanner before metrics even run
- Test 3 fails faithfulness because the response contradicts the context
Step 3: Use Pre-Built Templates
For common application types, use templates that come with sensible defaults:Step 4: Customize the Pipeline
Start from a template and iterate based on team feedback:Step 5: Export for CI/CD
Export the pipeline configuration as YAML or JSON, commit it to your repository, and load it in your CI/CD pipeline.Workflow Summary
| Step | Action | Method |
|---|---|---|
| 1 | Describe your app | AutoEvalPipeline.from_description(...) |
| 2 | Or use a template | AutoEvalPipeline.from_template("rag_system") |
| 3 | Run against test cases | pipeline.evaluate(inputs={...}) |
| 4 | Customize thresholds | pipeline.set_threshold(...), pipeline.add(...) |
| 5 | Export config | pipeline.export_yaml("pipeline.yaml") |
| 6 | Load in CI/CD | AutoEvalPipeline.from_yaml("pipeline.yaml") |
What to Try Next
AutoEval gives you the pipeline. But what if your LLM judge keeps getting the same cases wrong? Teach it from past mistakes using a feedback loop.Next: Feedback Loop
Store developer corrections in ChromaDB and teach your LLM judge to stop repeating mistakes.