Using Custom Datasets for Optimization - Future AGI Documentation

Datasets are the backbone of effective prompt optimization. They provide the ground-truth examples that the Evaluator uses to score your prompts, guiding the optimizer towards better performance. A high-quality, representative dataset is the single most important factor for a successful optimization run. This cookbook demonstrates how to prepare and use datasets from different sources with the agent-opt library.

The Required Data Format: A List of Dictionaries

Regardless of the source, the agent-opt library expects your final dataset to be a Python list of dictionaries. Each dictionary in the list represents a single data point or “row.” The keys of the dictionary are the column names, and the values are the corresponding data.

# This is the target format for all data sources
[
  {"column_1": "data A1", "column_2": "data B1"},
  {"column_1": "data A2", "column_2": "data B2"},
  # ... and so on
]

1. Creating In-Memory Datasets

The simplest way to get started, especially for quick tests or small experiments, is to define your dataset directly in your Python script.

Example: A Simple Q&A Dataset

# A list of dictionaries, ready to be used by the optimizer.
in_memory_dataset = [
    {
        "question": "What is the capital of France?",
        "context": "France is a country in Western Europe. Its capital and largest city is Paris.",
        "ground_truth_answer": "Paris"
    },
    {
        "question": "Who painted the Mona Lisa?",
        "context": "The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci.",
        "ground_truth_answer": "Leonardo da Vinci"
    },
]

2. Importing Datasets from Files

For larger datasets, you’ll typically load them from files. We recommend using the pandas library as it provides a simple and powerful way to read various formats and convert them into the required list of dictionaries.

a. From a CSV File

This is the most common format. Assuming you have a data.csv file:

question,context,ground_truth_answer
"What is the capital of France?","France is a country...","Paris"
"Who painted the Mona Lisa?","The Mona Lisa is a painting...","Leonardo da Vinci"

You can load it easily with pandas:

import pandas as pd

df = pd.read_csv("data.csv")

# The `to_dict("records")` method is the key to getting the correct format.
dataset_from_csv = df.to_dict(orient="records")

print(dataset_from_csv)
# Output:
# [
#   {'question': 'What is the capital of France?', 'context': 'France is a country...', 'ground_truth_answer': 'Paris'},
#   ...
# ]

b. From a JSON File (List of Objects)

If your data.json file is a list of objects, you can use either pandas or the built-in json library.

[
  {
    "question": "What is the capital of France?",
    "context": "France is a country...",
    "ground_truth_answer": "Paris"
  },
  {
    "question": "Who painted the Mona Lisa?",
    "context": "The Mona Lisa is a painting...",
    "ground_truth_answer": "Leonardo da Vinci"
  }
]

import pandas as pd

df = pd.read_json("data.json", orient="records")
dataset_from_json = df.to_dict(orient="records")

# Alternatively, with the json library:
# import json
# with open("data.json", "r") as f:
#     dataset_from_json = json.load(f)

c. From a JSONL File (JSON Lines)

For very large datasets, the JSON Lines (.jsonl) format is common, where each line is a separate JSON object. pandas handles this seamlessly.

{"question": "What is the capital of France?", "context": "France is a country...", "ground_truth_answer": "Paris"}
{"question": "Who painted the Mona Lisa?", "context": "The Mona Lisa is a painting...", "ground_truth_answer": "Leonardo da Vinci"}

import pandas as pd

df = pd.read_json("data.jsonl", lines=True)
dataset_from_jsonl = df.to_dict(orient="records")

3. The `DataMapper`: Connecting Your Dataset to the Optimizer

The DataMapper is a crucial component that acts as a “translator.” It tells the optimizer and evaluator how to use the columns from your dataset. You define this translation with a key_map dictionary:

The keys are the generic names that the Evaluator expects (e.g., response, context).
The values are the specific column names from your dataset (e.g., ground_truth_answer, article_text).

Example

Imagine your dataset has columns article_text and ideal_summary, and you are using the summary_quality evaluator, which expects inputs named input and output.

from fi.opt.datamappers import BasicDataMapper

# This tells the system how to connect the pieces.
data_mapper = BasicDataMapper(
    key_map={
        # Evaluator's expected key : Your dataset's column name
        "input": "article_text",
        
        # 'generated_output' is a special reserved name for the text
        # that comes from the Generator being optimized.
        "output": "generated_output" 
    }
)

4. Putting It All Together: A Complete Example

This example shows the full workflow, from loading a dataset to running an optimization.

import pandas as pd
from fi.opt.optimizers import RandomSearchOptimizer
from fi.opt.generators import LiteLLMGenerator
from fi.opt.base import Evaluator
from fi.opt.datamappers import BasicDataMapper

# --- 1. Load the Dataset ---
# For this example, we'll create it in-memory.
dataset = [
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who painted the Mona Lisa?", "answer": "Leonardo da Vinci"},
]

# --- 2. Configure the Evaluator ---
# We'll use the "answer_similarity" template, which compares two strings.
import os
# Add your FutureAGI API and Secret keys
os.environ["FI_API_KEY"] = "YOUR_API_KEY"
os.environ["FI_SECRET_KEY"] = "YOUR_SECRET_KEY"

evaluator = Evaluator(
    eval_template="answer_similarity",
    eval_model_name="turing_flash",
)

# --- 3. Configure the Data Mapper ---
# The 'answer_similarity' evaluator expects keys 'response' and 'expected_response'.
data_mapper = BasicDataMapper(
    key_map={
        "response": "generated_output",
        "expected_response": "answer" # Map our 'answer' column to the evaluator's expectation
    }
)

initial_prompt = "Q: {question}\nA:"  # A simple, mediocre prompt
# --- 4. Define the Initial Generator and Optimizer ---
initial_generator = LiteLLMGenerator(
    model="gpt-4o-mini",
    prompt_template=initial_prompt 
)

optimizer = RandomSearchOptimizer(
    generator=initial_generator,
    teacher_model="gpt-4o",
    num_variations=3
)

# --- 5. Run the Optimization ---
result = optimizer.optimize(
    evaluator=evaluator,
    data_mapper=data_mapper,
    dataset=dataset
)

print(f"Best Prompt Found:\n{result.best_generator.get_prompt_template()}")
print(f"Final Score: {result.final_score:.4f}")

Extras: Handling Large Datasets

Running optimization on a very large dataset can be slow and expensive. The agent-opt optimizers are designed to work effectively with a representative sample of your data. You can easily sample your dataset after loading it.

import pandas as pd
import random

# Load the full dataset (could have thousands of rows)
df = pd.read_csv("large_dataset.csv")
full_dataset = df.to_dict(orient="records")

# Select a random sample to use for optimization
sample_size = 100
if len(full_dataset) > sample_size:
    optimization_dataset = random.sample(full_dataset, sample_size)
else:
    optimization_dataset = full_dataset

print(f"Using a sample of {len(optimization_dataset)} examples for optimization.")

# Pass the smaller `optimization_dataset` to the optimizer
# result = optimizer.optimize(..., dataset=optimization_dataset)

A good sample size for most optimizers is between 30 and 200 examples. This provides a strong enough signal for improvement without excessive cost.

Best Practices for Datasets

Quality over Quantity

A small, high-quality, and diverse dataset of 20-50 examples is often more effective than a large, noisy dataset of thousands of examples. Ensure your ground-truth answers are accurate and consistent.

Represent Edge Cases

Your dataset should include examples of tricky or unusual inputs that your initial prompt struggles with. The optimizer will use these “hard cases” to learn how to make the prompt more robust.

Keep Column Names Simple

Use simple, descriptive column names in your source files (e.g., question, context, summary) to make mapping easier. Avoid spaces or special characters in column headers.

Cookbooks

​The Required Data Format: A List of Dictionaries

​1. Creating In-Memory Datasets

​Example: A Simple Q&A Dataset

​2. Importing Datasets from Files

​a. From a CSV File

​b. From a JSON File (List of Objects)

​c. From a JSONL File (JSON Lines)

​3. The DataMapper: Connecting Your Dataset to the Optimizer

​Example

​4. Putting It All Together: A Complete Example

​Extras: Handling Large Datasets

​Best Practices for Datasets

The Required Data Format: A List of Dictionaries

1. Creating In-Memory Datasets

Example: A Simple Q&A Dataset

2. Importing Datasets from Files

a. From a CSV File

b. From a JSON File (List of Objects)

c. From a JSONL File (JSON Lines)

3. The `DataMapper`: Connecting Your Dataset to the Optimizer

Example

4. Putting It All Together: A Complete Example

Extras: Handling Large Datasets

Best Practices for Datasets