Experiments in Dataset
To test, validate, and compare different prompt configurations
What it is
Experiments give you a structured way to answer questions like: Which prompt performs better? Which model gives the best results for my use case? You test different prompt and model combinations on the same dataset, score the outputs with evals, and compare results side by side — so you can make data-driven decisions instead of guessing.
Use cases
- Compare prompts – Run different prompt templates on the same rows and see which produces better answers or scores.
- Compare models – Run the same prompt with multiple models (or custom models) and compare quality, speed, or cost.
- Validate before rollout – Test prompt and model changes on a dataset before using them in production.
- Optimize with evals – Add built-in or custom evals and use scores to rank prompt–model combinations and pick a winner.
How to
You pick a base column (the generated responses you want to compare against), add one or more prompt templates (each with one or more models), attach evals, and run. The system generates responses for each prompt–model pair, runs the evals, and surfaces scores and comparisons so you can choose the best setup.
Navigate to Experiments
Click the “Experiments” button (e.g. in the top-right on the dataset dashboard) to open experiments for this dataset.

Create a new experiment
Give the experiment a name and select the base column – the column whose generated responses you want to compare (e.g. an existing run-prompt column). All experiment runs will be evaluated and compared against this baseline.

Prompt template
In the prompt template section, define the prompts and models for the experiment. You can add multiple prompt templates; each can use one or more models so you compare many combinations.

Choose the model type and model(s) you want for the experiment. You can select multiple models to compare. You can also create a custom model via “Create Custom Model”.
Select LLM for text generation (chat). Choose one or more chat models to compare prompt performance.

Tip
Click here to learn how to create a custom model.
Select Text-to-Speech to generate audio from text. Choose TTS models to compare voice output across prompts.

Tip
Click here to learn how to create a custom model.
Select Speech-to-Text to transcribe audio into text. Choose STT models to compare transcription quality.

Tip
Click here to learn how to create a custom model.
Select Image Generation to create images from text (or image + text). Choose image models to compare output quality.

Tip
Click here to learn how to create a custom model.
Use an existing prompt template or create a new one. You can add as many prompt templates as you need.
Tip
Click here to learn more about prompts.
Choosing evals
Experiments compare prompt–model performance using evals. Add the evals you want to run on the generated responses.
Click “Add Evaluation” and pick from existing eval templates or create a custom eval. You can add as many evals as you want.

Run experiment
After configuring prompts, models, and evals, click “Run” to start the experiment. The system will generate responses for each prompt–model pair, run the evals, and show results and comparisons when complete.
Update and re-run
You can change the experiment at any time: edit the name, base column, prompt templates, models, or evals, then save. Use Re-run to run the experiment again with the same or updated config (e.g. after adding rows to the dataset or changing a prompt). Re-run processes all rows again and refreshes the experiment dataset results.

Compare results
When the experiment has finished, use the Compare (or comparison) view to see how each prompt–model combination performed. Set weights for eval scores and metrics (e.g. response time, token usage) to compute an overall ranking. The comparison shows which combination ranks best so you can choose a winner.
What you can do next
Add Rows to Dataset
Add individual records or bulk import data rows to your dataset
Add Columns to Dataset
Extend your dataset structure with additional data fields
Run Prompts
Test and execute prompts against your dataset entries
Annotate Dataset
Add metadata and annotations to enrich your dataset
Create New Dataset
Create another dataset using SDK, file upload, or synthetic generation