Experiments in Dataset

To test, validate, and compare different prompt configurations

What it is

Experiments give you a structured way to answer questions like: Which prompt performs better? Which model gives the best results for my use case? You test different prompt and model combinations on the same dataset, score the outputs with evals, and compare results side by side — so you can make data-driven decisions instead of guessing.

Use cases

  • Compare prompts – Run different prompt templates on the same rows and see which produces better answers or scores.
  • Compare models – Run the same prompt with multiple models (or custom models) and compare quality, speed, or cost.
  • Validate before rollout – Test prompt and model changes on a dataset before using them in production.
  • Optimize with evals – Add built-in or custom evals and use scores to rank prompt–model combinations and pick a winner.

How to

You pick a base column (the generated responses you want to compare against), add one or more prompt templates (each with one or more models), attach evals, and run. The system generates responses for each prompt–model pair, runs the evals, and surfaces scores and comparisons so you can choose the best setup.

Navigate to Experiments

Click the “Experiments” button (e.g. in the top-right on the dataset dashboard) to open experiments for this dataset. Experiments

Create a new experiment

Give the experiment a name and select the base column – the column whose generated responses you want to compare (e.g. an existing run-prompt column). All experiment runs will be evaluated and compared against this baseline. Create Experiment

Prompt template

In the prompt template section, define the prompts and models for the experiment. You can add multiple prompt templates; each can use one or more models so you compare many combinations. Prompt Template

Choose the model type and model(s) you want for the experiment. You can select multiple models to compare. You can also create a custom model via “Create Custom Model”.

Select LLM for text generation (chat). Choose one or more chat models to compare prompt performance. LLM

Tip

Click here to learn how to create a custom model.

Select Text-to-Speech to generate audio from text. Choose TTS models to compare voice output across prompts. TTS

Tip

Click here to learn how to create a custom model.

Select Speech-to-Text to transcribe audio into text. Choose STT models to compare transcription quality. STT

Tip

Click here to learn how to create a custom model.

Select Image Generation to create images from text (or image + text). Choose image models to compare output quality. Image Generation

Tip

Click here to learn how to create a custom model.

Use an existing prompt template or create a new one. You can add as many prompt templates as you need.

Tip

Click here to learn more about prompts.

Choosing evals

Experiments compare prompt–model performance using evals. Add the evals you want to run on the generated responses. Choosing Evals Click “Add Evaluation” and pick from existing eval templates or create a custom eval. You can add as many evals as you want. Choosing Evals

Run experiment

After configuring prompts, models, and evals, click “Run” to start the experiment. The system will generate responses for each prompt–model pair, run the evals, and show results and comparisons when complete.

Update and re-run

You can change the experiment at any time: edit the name, base column, prompt templates, models, or evals, then save. Use Re-run to run the experiment again with the same or updated config (e.g. after adding rows to the dataset or changing a prompt). Re-run processes all rows again and refreshes the experiment dataset results. Update

Compare results

When the experiment has finished, use the Compare (or comparison) view to see how each prompt–model combination performed. Set weights for eval scores and metrics (e.g. response time, token usage) to compute an overall ranking. The comparison shows which combination ranks best so you can choose a winner.

What you can do next

Was this page helpful?

Questions & Discussion