How It Differs from Prompt Optimization
π― Prompt Optimization
- Targets agent system prompts
- Uses simulation (conversational) evaluation
- Best for improving agent behavior
- Input: Agent configuration
π Dataset Optimization
- Targets dataset column prompts
- Uses direct (input/output) evaluation
- Best for improving training & eval data
- Input: Dataset column + evaluation templates
Key Concepts
Optimization Run
An optimization run connects the following components:- π Column β The dataset column containing the prompt template to optimize.
- βοΈ Optimizer Algorithm β The strategy used to find better prompts (e.g., Bayesian Search, ProTeGi).
- π Evaluation Templates β The evaluations used to score how well each prompt variation performs.
- π§ Teacher Model β The LLM used for optimization decisions (generating new prompt candidates).
- π Inference Model β The LLM used to execute prompts and generate outputs during each trial.
Trials
Each optimization run consists of multiple trials:- π Baseline Trial β The original prompt is evaluated first to establish a baseline score.
- π Variation Trials β New prompt variations generated by the optimizer algorithm.
- Each trial receives an average score based on the configured evaluation templates.
Evaluation Templates
Evaluation templates define how each prompt variation is scored across dataset rows. You can use:- β
Built-in templates β Pre-configured evaluations like
summary_quality,context_adherence,tone, and more. - π οΈ Custom templates β Define your own evaluation criteria tailored to your specific use case.
How It Works
Create an Optimization Run
Navigate to your dataset and click the Optimize button in the top action bar. Select the column containing the prompt template you want to optimize, choose an optimizer algorithm, configure the teacher and inference models, and select your evaluation templates.
Baseline Evaluation
The system evaluates your original prompt against the dataset to establish a baseline score. This score serves as the benchmark for measuring improvements.
Optimization Loop
The optimizer generates new prompt variations, runs them against dataset rows (up to 50) using the inference model, and scores them using your evaluation templates. Each variation is saved as a trial with its results.
Optimizations can be paused and resumed β the optimizer state is persisted after each trial, so you wonβt lose progress if a run is interrupted.
Supported Optimizers
All six optimization algorithms are available for dataset optimization:| Optimizer | Speed | Quality | Best For |
|---|---|---|---|
| Random Search | β‘β‘β‘ | ββ | Quick baselines |
| Bayesian Search | β‘β‘ | ββββ | Few-shot learning tasks |
| Meta-Prompt | β‘β‘ | ββββ | Complex reasoning tasks |
| ProTeGi | β‘ | ββββ | Fixing specific error patterns |
| PromptWizard | β‘ | ββββ | Creative/open-ended exploration |
| GEPA | β‘ | βββββ | Production deployments |
Speed: β‘ = Slow, β‘β‘ = Medium, β‘β‘β‘ = Fast
Quality: β = Basic, βββββ = Excellent
When to Use Each Optimizer
π² Random Search β Quick Baseline
π² Random Search β Quick Baseline
Best for: Getting a fast baseline to compare against.Generates random prompt variations using a teacher model. No learning from previous attempts, but very fast.
π Bayesian Search β Few-Shot Learning
π Bayesian Search β Few-Shot Learning
Best for: Tasks that benefit from few-shot examples in the prompt.Uses Bayesian optimization to intelligently select the best combination and number of few-shot examples.
π§ Meta-Prompt β Deep Reasoning
π§ Meta-Prompt β Deep Reasoning
Best for: Complex tasks where the prompt needs holistic redesign.Analyzes failed examples, formulates hypotheses about what went wrong, and rewrites the entire prompt.
π¬ ProTeGi β Error-Driven Fixes
π¬ ProTeGi β Error-Driven Fixes
Best for: When you can identify clear failure patterns in outputs.Generates critiques of failures and applies targeted improvements using beam search.
πͺ PromptWizard β Creative Exploration
πͺ PromptWizard β Creative Exploration
Best for: Open-ended tasks where you want diverse prompt variations.Combines mutation with different βthinking stylesβ, then critiques and refines top performers.
𧬠GEPA β Production-Grade
𧬠GEPA β Production-Grade
Best for: Production deployments requiring the highest quality results.Uses evolutionary algorithms with reflective learning and mutation strategies. State-of-the-art performance.
Best Practices
π¦ Dataset Size
- Optimal range: 15β50 rows for optimization.
- The system evaluates up to 50 rows per trial for efficiency.
- Smaller datasets run faster but may produce less reliable scores.
π Evaluation Templates
- Use 1β3 evaluation templates that directly measure what matters for your task.
- Avoid conflicting evaluations that may confuse the optimizer.
- Use clear pass/fail criteria where possible.
π§ Choosing an Optimizer
Troubleshooting
No improvement after optimization
No improvement after optimization
Cause: Dataset may be too small or not diverse enough.Solution: Use more diverse examples (15β50 rows recommended) and ensure your evaluation templates clearly distinguish good from bad outputs.
High score variance between trials
High score variance between trials
Cause: Inconsistent or conflicting evaluation templates.Solution: Simplify your evaluations β use 1β2 clear templates instead of many overlapping ones.
Optimization running too slowly
Optimization running too slowly
Cause: Too many dataset rows or a slow optimizer.Solution: Reduce dataset size, or switch to a faster optimizer like Random Search or Bayesian Search.
Run failed mid-way
Run failed mid-way
Cause: API errors or timeout.Solution: Create a new run with the same configuration β the system can resume from where it left off using persisted optimizer state.