Learn how to prepare and integrate datasets from various sources (in-memory, CSV, JSON, JSONL) for effective prompt optimization.
Datasets are the backbone of effective prompt optimization. They provide the ground-truth examples that the Evaluator uses to score your prompts, guiding the optimizer towards better performance. A high-quality, representative dataset is the single most important factor for a successful optimization run.This cookbook demonstrates how to prepare and use datasets from different sources with the agent-opt library.
Regardless of the source, the agent-opt library expects your final dataset to be a Python list of dictionaries. Each dictionary in the list represents a single data point or “row.” The keys of the dictionary are the column names, and the values are the corresponding data.
Copy
Ask AI
# This is the target format for all data sources[ {"column_1": "data A1", "column_2": "data B1"}, {"column_1": "data A2", "column_2": "data B2"}, # ... and so on]
# A list of dictionaries, ready to be used by the optimizer.in_memory_dataset = [ { "question": "What is the capital of France?", "context": "France is a country in Western Europe. Its capital and largest city is Paris.", "ground_truth_answer": "Paris" }, { "question": "Who painted the Mona Lisa?", "context": "The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci.", "ground_truth_answer": "Leonardo da Vinci" },]
For larger datasets, you’ll typically load them from files. We recommend using the pandas library as it provides a simple and powerful way to read various formats and convert them into the required list of dictionaries.
This is the most common format. Assuming you have a data.csv file:
Copy
Ask AI
question,context,ground_truth_answer"What is the capital of France?","France is a country...","Paris""Who painted the Mona Lisa?","The Mona Lisa is a painting...","Leonardo da Vinci"
You can load it easily with pandas:
Copy
Ask AI
import pandas as pddf = pd.read_csv("data.csv")# The `to_dict("records")` method is the key to getting the correct format.dataset_from_csv = df.to_dict(orient="records")print(dataset_from_csv)# Output:# [# {'question': 'What is the capital of France?', 'context': 'France is a country...', 'ground_truth_answer': 'Paris'},# ...# ]
If your data.json file is a list of objects, you can use either pandas or the built-in json library.
Copy
Ask AI
[ { "question": "What is the capital of France?", "context": "France is a country...", "ground_truth_answer": "Paris" }, { "question": "Who painted the Mona Lisa?", "context": "The Mona Lisa is a painting...", "ground_truth_answer": "Leonardo da Vinci" }]
Copy
Ask AI
import pandas as pddf = pd.read_json("data.json", orient="records")dataset_from_json = df.to_dict(orient="records")# Alternatively, with the json library:# import json# with open("data.json", "r") as f:# dataset_from_json = json.load(f)
For very large datasets, the JSON Lines (.jsonl) format is common, where each line is a separate JSON object. pandas handles this seamlessly.
Copy
Ask AI
{"question": "What is the capital of France?", "context": "France is a country...", "ground_truth_answer": "Paris"}{"question": "Who painted the Mona Lisa?", "context": "The Mona Lisa is a painting...", "ground_truth_answer": "Leonardo da Vinci"}
Copy
Ask AI
import pandas as pddf = pd.read_json("data.jsonl", lines=True)dataset_from_jsonl = df.to_dict(orient="records")
3. The DataMapper: Connecting Your Dataset to the Optimizer
The DataMapper is a crucial component that acts as a “translator.” It tells the optimizer and evaluator how to use the columns from your dataset.You define this translation with a key_map dictionary:
The keys are the generic names that the Evaluator expects (e.g., response, context).
The values are the specific column names from your dataset (e.g., ground_truth_answer, article_text).
Imagine your dataset has columns article_text and ideal_summary, and you are using the summary_quality evaluator, which expects inputs named input and output.
Copy
Ask AI
from fi.opt.datamappers import BasicDataMapper# This tells the system how to connect the pieces.data_mapper = BasicDataMapper( key_map={ # Evaluator's expected key : Your dataset's column name "input": "article_text", # 'generated_output' is a special reserved name for the text # that comes from the Generator being optimized. "output": "generated_output" })
Running optimization on a very large dataset can be slow and expensive. The agent-opt optimizers are designed to work effectively with a representative sample of your data.You can easily sample your dataset after loading it.
Copy
Ask AI
import pandas as pdimport random# Load the full dataset (could have thousands of rows)df = pd.read_csv("large_dataset.csv")full_dataset = df.to_dict(orient="records")# Select a random sample to use for optimizationsample_size = 100if len(full_dataset) > sample_size: optimization_dataset = random.sample(full_dataset, sample_size)else: optimization_dataset = full_datasetprint(f"Using a sample of {len(optimization_dataset)} examples for optimization.")# Pass the smaller `optimization_dataset` to the optimizer# result = optimizer.optimize(..., dataset=optimization_dataset)
A good sample size for most optimizers is between 30 and 200 examples. This provides a strong enough signal for improvement without excessive cost.
A small, high-quality, and diverse dataset of 20-50 examples is often more effective than a large, noisy dataset of thousands of examples. Ensure your ground-truth answers are accurate and consistent.
Represent Edge Cases
Your dataset should include examples of tricky or unusual inputs that your initial prompt struggles with. The optimizer will use these “hard cases” to learn how to make the prompt more robust.
Keep Column Names Simple
Use simple, descriptive column names in your source files (e.g., question, context, summary) to make mapping easier. Avoid spaces or special characters in column headers.