Skip to main content
Dataset Optimization is the process of automatically improving prompt templates stored within your datasets. It uses the same optimization algorithms as Prompt Optimization but applies them to dataset columns rather than agent system prompts. This is useful when you want to systematically enhance the quality of prompts in your training or evaluation data using a structured, data-driven approach.

How It Differs from Prompt Optimization

🎯 Prompt Optimization

  • Targets agent system prompts
  • Uses simulation (conversational) evaluation
  • Best for improving agent behavior
  • Input: Agent configuration

πŸ“Š Dataset Optimization

  • Targets dataset column prompts
  • Uses direct (input/output) evaluation
  • Best for improving training & eval data
  • Input: Dataset column + evaluation templates

Key Concepts

Optimization Run

An optimization run connects the following components:
  • πŸ“‹ Column β€” The dataset column containing the prompt template to optimize.
  • βš™οΈ Optimizer Algorithm β€” The strategy used to find better prompts (e.g., Bayesian Search, ProTeGi).
  • πŸ“ Evaluation Templates β€” The evaluations used to score how well each prompt variation performs.
  • 🧠 Teacher Model β€” The LLM used for optimization decisions (generating new prompt candidates).
  • πŸš€ Inference Model β€” The LLM used to execute prompts and generate outputs during each trial.

Trials

Each optimization run consists of multiple trials:
  • πŸ“Œ Baseline Trial β€” The original prompt is evaluated first to establish a baseline score.
  • πŸ”„ Variation Trials β€” New prompt variations generated by the optimizer algorithm.
  • Each trial receives an average score based on the configured evaluation templates.

Evaluation Templates

Evaluation templates define how each prompt variation is scored across dataset rows. You can use:
  • βœ… Built-in templates β€” Pre-configured evaluations like summary_quality, context_adherence, tone, and more.
  • πŸ› οΈ Custom templates β€” Define your own evaluation criteria tailored to your specific use case.
You can browse and create evaluation templates from the Evaluations section of the platform.

How It Works

1

Create an Optimization Run

Navigate to your dataset and click the Optimize button in the top action bar. Select the column containing the prompt template you want to optimize, choose an optimizer algorithm, configure the teacher and inference models, and select your evaluation templates.
2

Baseline Evaluation

The system evaluates your original prompt against the dataset to establish a baseline score. This score serves as the benchmark for measuring improvements.
3

Optimization Loop

The optimizer generates new prompt variations, runs them against dataset rows (up to 50) using the inference model, and scores them using your evaluation templates. Each variation is saved as a trial with its results.
4

Review & Deploy

Compare the baseline against all optimized variations. The system highlights the best-performing prompt with measurable score improvements. πŸŽ‰
Optimizations can be paused and resumed β€” the optimizer state is persisted after each trial, so you won’t lose progress if a run is interrupted.

Supported Optimizers

All six optimization algorithms are available for dataset optimization:
OptimizerSpeedQualityBest For
Random Search⚑⚑⚑⭐⭐Quick baselines
Bayesian Search⚑⚑⭐⭐⭐⭐Few-shot learning tasks
Meta-Prompt⚑⚑⭐⭐⭐⭐Complex reasoning tasks
ProTeGi⚑⭐⭐⭐⭐Fixing specific error patterns
PromptWizard⚑⭐⭐⭐⭐Creative/open-ended exploration
GEPA⚑⭐⭐⭐⭐⭐Production deployments
Speed: ⚑ = Slow, ⚑⚑ = Medium, ⚑⚑⚑ = Fast Quality: ⭐ = Basic, ⭐⭐⭐⭐⭐ = Excellent
Start with Random Search for a quick baseline, then use a more advanced optimizer like ProTeGi or GEPA to push for higher quality. πŸš€
For detailed information on each algorithm, see the Optimization Algorithms page.

When to Use Each Optimizer

Best for: Getting a fast baseline to compare against.Generates random prompt variations using a teacher model. No learning from previous attempts, but very fast.
"optimizer_algorithm": "random_search"
"optimizer_config": { "num_variations": 5 }
Best for: Tasks that benefit from few-shot examples in the prompt.Uses Bayesian optimization to intelligently select the best combination and number of few-shot examples.
"optimizer_algorithm": "bayesian"
"optimizer_config": { "n_trials": 20, "min_examples": 2, "max_examples": 8 }
Best for: Complex tasks where the prompt needs holistic redesign.Analyzes failed examples, formulates hypotheses about what went wrong, and rewrites the entire prompt.
"optimizer_algorithm": "metaprompt"
"optimizer_config": { "num_rounds": 5 }
Best for: When you can identify clear failure patterns in outputs.Generates critiques of failures and applies targeted improvements using beam search.
"optimizer_algorithm": "protegi"
"optimizer_config": { "num_rounds": 3, "beam_size": 3, "num_gradients": 4 }
Best for: Open-ended tasks where you want diverse prompt variations.Combines mutation with different β€œthinking styles”, then critiques and refines top performers.
"optimizer_algorithm": "promptwizard"
"optimizer_config": { "refine_iterations": 2, "mutate_rounds": 3 }
Best for: Production deployments requiring the highest quality results.Uses evolutionary algorithms with reflective learning and mutation strategies. State-of-the-art performance.
"optimizer_algorithm": "gepa"
"optimizer_config": { "max_metric_calls": 150 }

Best Practices

πŸ“¦ Dataset Size

  • Optimal range: 15–50 rows for optimization.
  • The system evaluates up to 50 rows per trial for efficiency.
  • Smaller datasets run faster but may produce less reliable scores.

πŸ“ Evaluation Templates

  • Use 1–3 evaluation templates that directly measure what matters for your task.
  • Avoid conflicting evaluations that may confuse the optimizer.
  • Use clear pass/fail criteria where possible.

🧭 Choosing an Optimizer

Do you need production-grade results?
β”œβ”€ Yes β†’ Use GEPA 🧬
└─ No
   β”‚
   Do you have few-shot examples?
   β”œβ”€ Yes β†’ Use Bayesian Search πŸ“ˆ
   └─ No
      β”‚
      Is your task reasoning-heavy?
      β”œβ”€ Yes β†’ Use Meta-Prompt 🧠
      └─ No
         β”‚
         Do you have clear failure patterns?
         β”œβ”€ Yes β†’ Use ProTeGi πŸ”¬
         └─ No β†’ Use Random Search 🎲 (baseline)

Troubleshooting

Cause: Dataset may be too small or not diverse enough.Solution: Use more diverse examples (15–50 rows recommended) and ensure your evaluation templates clearly distinguish good from bad outputs.
Cause: Inconsistent or conflicting evaluation templates.Solution: Simplify your evaluations β€” use 1–2 clear templates instead of many overlapping ones.
Cause: Too many dataset rows or a slow optimizer.Solution: Reduce dataset size, or switch to a faster optimizer like Random Search or Bayesian Search.
Cause: API errors or timeout.Solution: Create a new run with the same configuration β€” the system can resume from where it left off using persisted optimizer state.

Next Steps