Prompt Optimization: Improve a Prompt Automatically

Use agent-opt to take a weak baseline prompt, run automated optimization, and extract the best-performing variant. No manual prompt engineering required.

📝

TL;DR

Take a weak baseline prompt, run automated optimization with the agent-opt SDK, and extract the best-performing variant with before/after scores — no manual prompt engineering required.

Time	Difficulty	Package
15 min	Intermediate	`agent-opt`

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
An OpenAI API key (used by the optimizer’s teacher model)
Python 3.9+

Install

pip install agent-opt

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
export OPENAI_API_KEY="your-openai-api-key"

Tutorial

Define your dataset and baseline prompt

The optimizer needs labeled examples - inputs where you know what a good output looks like. This is your ground truth for scoring.

Tip

Pro tip: To bootstrap labeled examples faster, start with Generate Synthetic Data and then refine labels.

# Your test dataset: complex multi-fact articles that need precise extraction
dataset = [
    {
        "article": "A Phase III clinical trial conducted across 47 hospitals in 12 countries found that combining immunotherapy drug pembrolizumab with a novel mRNA vaccine reduced melanoma recurrence by 44% compared to pembrolizumab alone over a 3-year follow-up period. However, 18% of patients in the combination group experienced grade 3+ immune-related adverse events, compared to 11% in the monotherapy group. The trial enrolled 1,089 patients with stage IIB-IV melanoma who had undergone complete surgical resection. Researchers noted that the benefit was most pronounced in patients with PD-L1-positive tumors, where recurrence dropped by 62%.",
        "target_summary": "A 12-country Phase III trial of 1,089 melanoma patients showed that combining pembrolizumab with an mRNA vaccine cut recurrence by 44% (62% in PD-L1-positive cases) over 3 years, though grade 3+ adverse events rose from 11% to 18%.",
    },
    {
        "article": "The European Central Bank raised interest rates by 25 basis points to 4.5%, marking the tenth consecutive hike since July 2022. Core inflation in the eurozone fell to 4.3% in September from 5.3% in August, but remains well above the 2% target. ECB President Christine Lagarde stated that rates have reached a level that, 'maintained for a sufficiently long duration, will make a substantial contribution to the timely return of inflation to the target.' Markets are now pricing in rate cuts starting Q2 2024, though several governing council members pushed back against this expectation.",
        "target_summary": "The ECB raised rates 25bp to 4.5% (tenth straight hike), with core eurozone inflation falling to 4.3% but still above the 2% target; Lagarde signaled a hold while markets price cuts from Q2 2024.",
    },
    {
        "article": "Meta's new open-source large language model Llama 3 was trained on 15 trillion tokens using a cluster of 16,384 NVIDIA H100 GPUs over approximately 54 days. The 70B parameter model achieves 82.0 on MMLU, surpassing GPT-3.5 Turbo and approaching GPT-4's performance on several benchmarks. However, independent evaluations revealed significant weaknesses in mathematical reasoning (scoring 48.2 on MATH versus GPT-4's 67.1) and multilingual tasks. The model was released under a permissive license allowing commercial use for companies with fewer than 700 million monthly active users.",
        "target_summary": "Meta's Llama 3 (70B params, trained on 15T tokens with 16K H100s) scores 82.0 on MMLU near GPT-4 level, but lags in math (48.2 vs 67.1 on MATH); commercially licensed for companies under 700M MAU.",
    },
    {
        "article": "Japan's population declined by 837,000 in 2023, the largest annual drop since records began in 1968. The fertility rate fell to 1.20, well below the 2.1 replacement level. Prime Minister Kishida announced a $25 billion child-rearing support package including increased childcare subsidies, flexible work mandates for companies with 100+ employees, and a new parental leave scheme covering 80% of salary for up to 28 weeks. Economists warn that without immigration reform, Japan's working-age population will shrink by 40% by 2065, threatening pension systems and GDP growth.",
        "target_summary": "Japan lost 837K people in 2023 (record drop) with fertility at 1.20; Kishida's $25B support package includes childcare subsidies and 80%-salary parental leave, but economists warn the working-age population could shrink 40% by 2065 without immigration reform.",
    },
    {
        "article": "A collaboration between DeepMind and Isomorphic Labs used an updated version of AlphaFold to predict the structures of all 214 million known proteins plus 100 million protein-ligand interactions. The new model achieves atomic-level accuracy for 78% of predictions, up from 58% in the original AlphaFold 2. Drug discovery company Recursion Pharmaceuticals reported that integrating AlphaFold predictions into their pipeline reduced the hit-to-lead phase from 18 months to 4 months for two oncology programs, though three other programs showed no significant speedup due to limitations in predicting protein dynamics and post-translational modifications.",
        "target_summary": "DeepMind's updated AlphaFold now predicts 214M protein structures and 100M protein-ligand interactions with 78% atomic accuracy (up from 58%); Recursion cut hit-to-lead from 18 to 4 months in 2 oncology programs, though 3 others saw no benefit.",
    },
    {
        "article": "Tesla reported Q3 2024 revenue of $25.18 billion, a 9% year-over-year increase driven by 462,890 vehicle deliveries. However, automotive gross margins fell to 17.9% from 26.2% a year earlier due to aggressive price cuts averaging 15-25% across models. The energy generation and storage segment grew 40% to $2.38 billion, with Megapack deployments reaching 5.8 GWh. CEO Elon Musk reaffirmed the 2025 launch timeline for a sub-$30,000 vehicle codenamed 'Redwood' and announced that FSD v12.5 had achieved 1 billion cumulative miles driven.",
        "target_summary": "Tesla Q3 revenue rose 9% to $25.18B on 462,890 deliveries, but auto margins fell to 17.9% (from 26.2%) amid 15-25% price cuts; energy storage grew 40% to $2.38B, and Musk confirmed a sub-$30K 'Redwood' vehicle for 2025.",
    },
    {
        "article": "The UN's 2024 Global Biodiversity Framework progress report found that only 8 of 23 Kunming-Montreal targets are on track for 2030. Protected areas now cover 17.6% of terrestrial land (target: 30%) and 8.4% of marine areas (target: 30%). Pesticide use increased by 4% globally despite a target to reduce it by 50%. Positive developments include a 23% increase in indigenous-led conservation areas and $15.4 billion in biodiversity finance mobilized in 2023, though the target is $200 billion annually by 2030. Deforestation rates in the Amazon fell 33% compared to 2022, while Southeast Asian deforestation accelerated by 12%.",
        "target_summary": "Only 8 of 23 UN biodiversity targets are on track: terrestrial protection at 17.6% (target 30%), marine at 8.4%, pesticide use up 4% despite a 50% reduction goal; Amazon deforestation fell 33% but rose 12% in Southeast Asia, and biodiversity finance reached $15.4B of a $200B target.",
    },
]

# Deliberately bad baseline prompt — vague, no structure, no constraints
baseline_prompt = "Tell me about this: {article}"

Configure the Evaluator

The Evaluator scores each candidate prompt’s outputs during optimization. It uses FutureAGI’s Turing models to judge output quality against the source article (e.g., how well the summary captures key information).

from fi.opt.base import Evaluator

evaluator = Evaluator(
    eval_template="summary_quality",  # Turing model — scores how well the summary captures the article
    eval_model_name="turing_flash",   # fast and accurate for optimization rounds
)

Score the baseline prompt

Before optimizing, measure the baseline so you have a comparison point.

from openai import OpenAI
from fi.evals import Evaluator as FIEvaluator
from fi.opt.datamappers import BasicDataMapper

client = OpenAI()

data_mapper = BasicDataMapper(
    key_map={
        "input":  "article",
        "output": "generated_output",
    }
)

# Score the baseline on the first 3 examples
baseline_eval = FIEvaluator(
    fi_api_key=os.environ["FI_API_KEY"],
    fi_secret_key=os.environ["FI_SECRET_KEY"],
)

baseline_scores = []
for item in dataset[:3]:
    prompt = baseline_prompt.format(article=item["article"])
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    output = response.choices[0].message.content

    result = baseline_eval.evaluate(
        eval_templates="summary_quality",
        inputs={"output": output, "input": item["article"]},
        model_name="turing_flash",
    )
    baseline_scores.append(float(result.eval_results[0].output))

baseline_avg = sum(baseline_scores) / len(baseline_scores)
print(f"Baseline average score: {baseline_avg:.3f}")

Run MetaPrompt optimization

MetaPromptOptimizer uses a powerful teacher model (GPT-4o) to iteratively rewrite and improve the prompt. Each round generates candidate prompts, scores them with the evaluator, and keeps the best.

from fi.opt.generators import LiteLLMGenerator
from fi.opt.optimizers import MetaPromptOptimizer

# Teacher model - the LLM that rewrites prompts
teacher = LiteLLMGenerator(model="gpt-4o", prompt_template="{prompt}")

optimizer = MetaPromptOptimizer(
    teacher_generator=teacher,
)

result = optimizer.optimize(
    evaluator=evaluator,
    data_mapper=data_mapper,
    dataset=dataset,
    initial_prompts=[baseline_prompt],
    task_description="Generate a concise, one-sentence news summary that captures the key fact and impact.",
    eval_subset_size=7,  # evaluate all 7 examples per round
)

Optimization typically takes 2-5 minutes depending on dataset size and number of rounds.

Compare results and extract the winning prompt

print(f"\n--- Optimization Results ---")
print(f"Baseline score:  {baseline_avg:.3f}")
print(f"Optimized score: {result.final_score:.3f}")
print(f"Improvement:     +{result.final_score - baseline_avg:.3f}\n")

print("Best prompt found:")
print("-" * 60)
best_prompt = result.best_generator.get_prompt_template()
print(best_prompt)
print("-" * 60)

# Show round-by-round progress
print("\nOptimization history:")
for i, iteration in enumerate(result.history):
    print(f"  Round {i+1}: score={iteration.average_score:.3f}")

Example output:

--- Optimization Results ---
Baseline score:  0.421
Optimized score: 0.847
Improvement:     +0.426

Best prompt found:
------------------------------------------------------------
Write a single, precise sentence that summarizes the most
important finding or event in the article, including any
key statistic, named entity, or deadline. Focus on what
is new, not background information.

Article: {article}
------------------------------------------------------------

Optimization history:
  Round 1: score=0.531
  Round 2: score=0.673
  Round 3: score=0.741
  Round 4: score=0.804
  Round 5: score=0.847

Use the optimized prompt in your application

from openai import OpenAI

client = OpenAI()

def summarize(article: str) -> str:
    # Slot the winning prompt template
    prompt = best_prompt.replace("{article}", article)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content


# Test it on a new article
test_article = """
NASA's Artemis III mission has been delayed until 2027 due to spacesuit development
challenges. The mission was originally planned for 2025 and would be the first
crewed lunar landing since Apollo 17 in 1972.
"""

print(summarize(test_article))
# → "NASA's Artemis III lunar landing has been postponed to 2027 due to spacesuit delays."

Tip

Save the winning prompt to FutureAGI’s Prompt Management so it’s versioned, shareable, and can be fetched by name in production. See Prompt Versioning.

Alternative: Bayesian Search for few-shot optimization

If your task benefits from few-shot examples, use BayesianSearchOptimizer instead; it finds the optimal number and selection of examples to include in the prompt automatically.

from fi.opt.optimizers import BayesianSearchOptimizer

# Reuse the same summarization dataset from Step 1
bayesian_optimizer = BayesianSearchOptimizer(
    inference_model_name="gpt-4o-mini",
    n_trials=10,         # configurations to test
    min_examples=1,
    max_examples=3,
    example_template="Article: {article}\nSummary: {target_summary}",
)

result = bayesian_optimizer.optimize(
    evaluator=evaluator,
    data_mapper=data_mapper,
    dataset=dataset,
    initial_prompts=["Write a concise one-sentence summary of this article:"],
)

print(f"Best few-shot prompt:\n{result.best_generator.get_prompt_template()}")

Other optimization strategies

This guide uses MetaPrompt and Bayesian Search, but FutureAGI offers six optimization algorithms — each suited to different scenarios:

Optimizer	Best for	How it works
Meta-Prompt	General prompt improvement	A teacher LLM iteratively rewrites the prompt based on eval feedback
Bayesian Search	Few-shot example selection	Uses Bayesian optimization to find the best number and combination of examples
ProTeGi	Targeted prompt editing	Generates localized edits to specific parts of the prompt, then tests each
GEPA	Exploring diverse prompt styles	Evolutionary approach — breeds, mutates, and selects prompts over generations
PromptWizard	Multi-stage refinement	Combines critique, refinement, and example synthesis in a structured pipeline
Random Search	Quick baseline comparison	Generates random prompt variants and picks the best — useful as a sanity check

Tip

Not sure which to pick? Start with Meta-Prompt for instruction tuning or Bayesian Search for few-shot tasks. See the Optimizers Overview for a detailed comparison and decision tree.

What you built

You can now automatically optimize any prompt using MetaPromptOptimizer or BayesianSearchOptimizer, measure the improvement, and deploy the winning variant.

Defined a labeled dataset and a weak baseline prompt
Scored the baseline to establish a comparison point
Ran MetaPromptOptimizer for 5 rounds of automated refinement
Extracted the winning prompt and measured the improvement (+0.426 in this example)
Swapped the optimized prompt into the application with no other code changes
Learned when to use BayesianSearchOptimizer for few-shot tasks instead

Questions & Discussion