Experimentation: Compare Prompts and Models on a Dataset
Use the Experimentation feature to run multiple prompt variants across different models on the same dataset, evaluate outputs, and pick the winning configuration.
Use Experimentation to test multiple prompt variants and models side by side on the same dataset, evaluate the generated outputs, and pick the best configuration using weighted metric comparison.
| Time | Difficulty | Package |
|---|---|---|
| 15 min | Intermediate | Platform UI |
By the end of this guide you will have run two prompt variants on the same dataset, evaluated the generated outputs for groundedness, and compared results in the Summary tab to pick the best prompt.
- FutureAGI account: app.futureagi.com
- A dataset with at least
question,context, andexpected_answercolumns (follow Step 1 to create one) - An LLM API key configured in the platform
Create a test dataset
Go to app.futureagi.com. Select Dataset > Add Dataset > Upload a file (JSON, CSV).
Save as experiment-data.csv and upload:
question,context,expected_answer
"What year was the Eiffel Tower completed?","The Eiffel Tower was completed in 1889 after two years of construction.","1889"
"Who designed the Eiffel Tower?","The Eiffel Tower was designed by engineer Gustave Eiffel and his team.","Gustave Eiffel and his team"
"How tall is the Eiffel Tower?","The Eiffel Tower stands 330 metres tall including the antenna at the top.","330 metres"
"Where is the Eiffel Tower located?","The Eiffel Tower is located on the Champ de Mars in Paris, France.","Champ de Mars in Paris, France"
"How many visitors does the Eiffel Tower receive annually?","The Eiffel Tower receives approximately 7 million visitors per year.","Around 7 million" Open the experiment form
Open your dataset → click Experiment in the dataset toolbar.
The Run Experiment drawer opens with the subtitle “Test, validate, and compare different prompt configurations.”
Fill in the top-level fields:
| Field | Value |
|---|---|
| Name | prompt-ab-test |
| Select Baseline Column | expected_answer |
Configure Prompt Template 1
The first Prompt Template 1 accordion is already open. Fill in:
- Prompt Name:
baseline-prompt - Choose a model type: Select LLM (other options: Text-to-Speech, Speech-to-Text, Image Generation)
- Models: Select one or more models — e.g.
gpt-4o-miniandgpt-4o. You can select multiple models per prompt to compare model performance too. - Write the prompt messages:
System message:
You are a helpful assistant. Answer questions using only the provided context.User message:
Context: {{context}}
Question: {{question}}Tip
Use {{column_name}} to reference dataset columns in your prompt. The platform auto-detects variables from your messages.
Add Prompt Template 2
Click Add Another Prompt.
A new Prompt Template 2 accordion appears. Fill in:
- Prompt Name:
cot-prompt - Choose a model type: LLM
- Models: Select the same models (
gpt-4o-mini,gpt-4o) - Write the prompt messages:
System message:
You are a precise question-answering assistant. Use only the information provided in the context — do not add any external knowledge.User message:
Step 1: Read the context carefully.
Step 2: Identify the specific fact that answers the question.
Step 3: Write a concise answer using only that fact.
Context: {{context}}
Question: {{question}}
Answer: Run the experiment
Click Run.
The platform runs both prompt templates across all selected models on every dataset row and generates outputs.
Evaluate the generated outputs
Once the experiment finishes, go to the Data tab in the experiment detail view.
- Click Evaluate (top-right of the Data tab)
- The Evaluation drawer opens — add
groundedness - Map keys:
output→ the generated output column,context→context,input→question - Run the evaluation
Eval scores appear as grouped columns under the evaluation metric name (e.g. groundedness). Within each group, each prompt variant’s score is shown side by side — e.g. groundedness-baseline-prompt-gpt-4o-mini, groundedness-cot-prompt-gpt-4o-mini — so you can compare scores across variants at a glance.
Note
Evaluations run on the experiment’s generated output columns — not on the original dataset columns. You run evals after the experiment completes, on the outputs it produced.
Compare results in the Summary tab
Switch to the Summary tab to see:
- Summary table — aggregate scores per prompt variant and model, including average response time, total tokens, and completion tokens
- Spider chart — visual comparison of all evaluation metrics across variants
- Evaluation charts — per-metric score distribution across prompt/model combinations
Pick the winner
- Click Choose Winner (crown icon) in the Summary tab
- The Winner Settings drawer opens — adjust importance weights (0–10 scale) for:
- Evaluation metrics (e.g. groundedness)
- Average Response Time
- Completion Tokens
- Total Tokens
- Click Save & Run
The winning variant is ranked in the summary table.
What you built
You can now run prompt A/B tests across multiple models, evaluate outputs, and pick the winning variant using weighted metrics.
- Created a 5-row test dataset with questions, context, and expected answers
- Configured two prompt variants (baseline and chain-of-thought) across multiple models
- Ran the experiment to generate outputs for every prompt/model/row combination
- Evaluated generated outputs for groundedness from the Data tab
- Compared aggregate scores and performance in the Summary tab with spider charts
- Selected the winning prompt using weighted metric comparison