Experimentation: Compare Prompts and Models on a Dataset

Test multiple prompt variants and models on the same dataset, evaluate outputs, and pick the best configuration using weighted metric comparison.

📝

TL;DR

Use Experimentation to test multiple prompt variants and models side by side on the same dataset, evaluate the generated outputs, and pick the best configuration using weighted metric comparison.

Time	Difficulty	Package
15 min	Intermediate	Platform UI

By the end of this guide you will have run two prompt variants on the same dataset, evaluated the generated outputs for groundedness, and compared results in the Summary tab to pick the best prompt.

Prerequisites

FutureAGI account: app.futureagi.com
A dataset with at least question, context, and expected_answer columns (follow Step 1 to create one)
An LLM API key configured in the platform

Create a test dataset

Go to app.futureagi.com. Select Dataset > Add Dataset > Upload a file (JSON, CSV).

Save as experiment-data.csv and upload:

question,context,expected_answer
"What year was the Eiffel Tower completed?","The Eiffel Tower was completed in 1889 after two years of construction.","1889"
"Who designed the Eiffel Tower?","The Eiffel Tower was designed by engineer Gustave Eiffel and his team.","Gustave Eiffel and his team"
"How tall is the Eiffel Tower?","The Eiffel Tower stands 330 metres tall including the antenna at the top.","330 metres"
"Where is the Eiffel Tower located?","The Eiffel Tower is located on the Champ de Mars in Paris, France.","Champ de Mars in Paris, France"
"How many visitors does the Eiffel Tower receive annually?","The Eiffel Tower receives approximately 7 million visitors per year.","Around 7 million"

Open the experiment form

Open your dataset → click Experiment in the dataset toolbar.

The Run Experiment drawer opens with the subtitle “Test, validate, and compare different prompt configurations.”

Fill in the top-level fields:

Field	Value
Name	`prompt-ab-test`
Select Baseline Column	`expected_answer`

Configure Prompt Template 1

The first Prompt Template 1 accordion is already open. Fill in:

Prompt Name: baseline-prompt
Choose a model type: Select LLM (other options: Text-to-Speech, Speech-to-Text, Image Generation)
Models: Select one or more models — e.g. gpt-4o-mini and gpt-4o. You can select multiple models per prompt to compare model performance too.
Write the prompt messages:

System message:

You are a helpful assistant. Answer questions using only the provided context.

User message:

Context: {{context}}
Question: {{question}}

Tip

Use {{column_name}} to reference dataset columns in your prompt. The platform auto-detects variables from your messages.

Add Prompt Template 2

Click Add Another Prompt.

A new Prompt Template 2 accordion appears. Fill in:

Prompt Name: cot-prompt
Choose a model type: LLM
Models: Select the same models (gpt-4o-mini, gpt-4o)
Write the prompt messages:

System message:

You are a precise question-answering assistant. Use only the information provided in the context — do not add any external knowledge.

User message:

Step 1: Read the context carefully.
Step 2: Identify the specific fact that answers the question.
Step 3: Write a concise answer using only that fact.

Context: {{context}}
Question: {{question}}

Answer:

Run the experiment

Click Run.

The platform runs both prompt templates across all selected models on every dataset row and generates outputs.

Evaluate the generated outputs

Once the experiment finishes, go to the Data tab in the experiment detail view.

Click Evaluate (top-right of the Data tab)
The Evaluation drawer opens — add groundedness
Map keys: output → the generated output column, context → context, input → question
Run the evaluation

Eval scores appear as grouped columns under the evaluation metric name (e.g. groundedness). Within each group, each prompt variant’s score is shown side by side — e.g. groundedness-baseline-prompt-gpt-4o-mini, groundedness-cot-prompt-gpt-4o-mini — so you can compare scores across variants at a glance.

Note

Evaluations run on the experiment’s generated output columns — not on the original dataset columns. You run evals after the experiment completes, on the outputs it produced.

Compare results in the Summary tab

Switch to the Summary tab to see:

Summary table — aggregate scores per prompt variant and model, including average response time, total tokens, and completion tokens
Spider chart — visual comparison of all evaluation metrics across variants
Evaluation charts — per-metric score distribution across prompt/model combinations

Pick the winner

Click Choose Winner (crown icon) in the Summary tab
The Winner Settings drawer opens — adjust importance weights (0–10 scale) for:
- Evaluation metrics (e.g. groundedness)
- Average Response Time
- Completion Tokens
- Total Tokens
Click Save & Run

The winning variant is ranked in the summary table.

What you built

You can now run prompt A/B tests across multiple models, evaluate outputs, and pick the winning variant using weighted metrics.

Created a 5-row test dataset with questions, context, and expected answers
Configured two prompt variants (baseline and chain-of-thought) across multiple models
Ran the experiment to generate outputs for every prompt/model/row combination
Evaluated generated outputs for groundedness from the Data tab
Compared aggregate scores and performance in the Summary tab with spider charts
Selected the winning prompt using weighted metric comparison

Questions & Discussion