Comparing ProTeGi, GEPA, and PromptWizard Optimization

Run three optimization algorithms on the same task with different evaluation metrics and compare results to pick the best strategy for your use case.

📝

TL;DR

Run ProTeGi, GEPA, and PromptWizard on the same task with different eval metrics, then compare winning prompts side by side to pick the best optimization strategy.

By the end of this guide you will have run ProTeGi, GEPA, and PromptWizard on the same customer support task, used different evaluation metrics to score candidates, and compared the winning prompts side by side.

Time	Difficulty	Package
15 min	Intermediate	`agent-opt`

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
An OpenAI API key (used by the optimizer’s teacher model)
Python 3.9+

Install

pip install agent-opt

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
export OPENAI_API_KEY="your-openai-api-key"

Tip

This cookbook builds on Prompt Optimization, which covers MetaPrompt and Bayesian Search. Here we focus on the remaining three strategies with a different task.

Define the dataset and baseline

A customer support response task — the optimizer must improve how well the agent answers user questions using provided context.

from fi.opt.base import Evaluator
from fi.opt.datamappers import BasicDataMapper
from fi.opt.generators import LiteLLMGenerator

# Complex multi-constraint support scenarios — a vague prompt will miss key details
dataset = [
    {
        "question": "I signed up for the annual plan 3 months ago but now I want to switch to monthly. Do I get a refund for the unused months, and will I lose my team seats?",
        "context": "Annual plans can be switched to monthly at any time via Settings → Billing → Change Plan. A prorated refund is issued for unused months minus a 10% early termination fee. Team seats are preserved during the switch but each seat price increases from $8/month (annual) to $12/month (monthly). The switch takes effect immediately and the next monthly charge occurs 30 days later. If the account has more than 5 seats, a support ticket is required instead of self-service.",
        "ideal_response": "You can switch via Settings → Billing → Change Plan (or file a support ticket if you have more than 5 seats). You'll receive a prorated refund for unused months minus a 10% early termination fee. Team seats are preserved, but seat pricing increases from $8/month to $12/month. The switch is immediate, with your next monthly charge in 30 days.",
    },
    {
        "question": "Our SSO integration broke after the latest update. Users are getting 403 errors when trying to log in through Okta, but direct login still works.",
        "context": "SSO 403 errors after platform updates are typically caused by an expired SAML certificate or a changed Assertion Consumer Service (ACS) URL. Step 1: Check Settings → Security → SSO → verify the ACS URL matches your IdP configuration (it may have changed to include /v2/ after the update). Step 2: Re-download the SP metadata XML and upload it to Okta. Step 3: If the certificate expired, generate a new one from Settings → Security → Certificates. Note: SSO changes take up to 15 minutes to propagate. If issues persist after 15 minutes, contact support with the X-Request-ID from the 403 response header.",
        "ideal_response": "This is likely caused by a changed ACS URL or expired SAML certificate after the update. Check Settings → Security → SSO to verify the ACS URL (it may now include /v2/). Re-download the SP metadata XML and re-upload it to Okta. If your certificate expired, generate a new one under Settings → Security → Certificates. Changes take up to 15 minutes to propagate. If it still fails, contact support with the X-Request-ID from the 403 response header.",
    },
    {
        "question": "We need to comply with GDPR. Can you delete all data for users in the EU, and how do I prove the deletion happened?",
        "context": "GDPR data deletion requests can be submitted via Settings → Compliance → Data Deletion Request. You must specify users by email domain or a CSV upload of user IDs. Deletion is irreversible and covers: user profiles, activity logs, generated content, and analytics data. Deletion is queued and completed within 72 hours. A signed deletion certificate (PDF) is automatically emailed to the account owner and the requesting user. Backup data in cold storage is purged within 30 days. Note: aggregated anonymized analytics are exempt from deletion per GDPR Article 89. Active subscriptions must be cancelled before deletion can proceed.",
        "ideal_response": "Submit a request via Settings → Compliance → Data Deletion Request using email domains or a CSV of user IDs. Active subscriptions must be cancelled first. Deletion covers profiles, activity logs, content, and analytics, and completes within 72 hours (cold storage backups within 30 days). You'll receive a signed deletion certificate (PDF) as proof. Note: aggregated anonymized analytics are exempt under GDPR Article 89.",
    },
    {
        "question": "I'm seeing high latency on the API — responses that used to take 200ms are now taking 2-3 seconds. Nothing changed on our side.",
        "context": "API latency increases can be caused by: (1) Rate limiting — if you're near your plan's request limit, responses are throttled with increasing delay. Check the X-RateLimit-Remaining header. (2) Region routing — requests may be routed to a farther region during maintenance windows (check status.futureagi.com). (3) Payload size — responses over 50KB trigger compression which adds ~100ms. (4) Deprecated endpoint versions — v1 endpoints have a 500ms artificial delay to encourage migration to v2. Check your base URL. For persistent issues, enable request tracing by adding X-Debug-Trace: true header, then share the trace ID with support.",
        "ideal_response": "Check these common causes: (1) Rate limiting — inspect the X-RateLimit-Remaining header to see if you're being throttled. (2) Region routing — check status.futureagi.com for maintenance windows that may reroute requests. (3) Deprecated endpoints — v1 endpoints have a 500ms artificial delay; verify your base URL uses v2. (4) Large payloads — responses over 50KB trigger compression overhead. For debugging, add the X-Debug-Trace: true header and share the trace ID with support.",
    },
    {
        "question": "We want to set up a staging environment that mirrors production but with test data. How do we handle API keys and billing?",
        "context": "Staging environments can be created under Settings → Environments → Add Environment. Each environment gets its own API keys, separate usage tracking, and isolated data. Staging environments are free up to 10,000 API calls/month; beyond that, usage is billed at 50% of production rates. To mirror production config: use the 'Clone from Production' button which copies all settings, webhooks, and integrations (but not data). Test API keys have a 'test_' prefix and will reject production data patterns (credit cards, SSNs) as a safety measure. Staging and production share the same SSO configuration but can have different role assignments.",
        "ideal_response": "Create one via Settings → Environments → Add Environment, then use 'Clone from Production' to copy settings, webhooks, and integrations. Staging gets separate API keys (prefixed with 'test_') and isolated data. It's free up to 10K API calls/month, then billed at 50% of production rates. Test keys automatically reject production data patterns (credit cards, SSNs). SSO is shared between environments but role assignments can differ.",
    },
    {
        "question": "A webhook we set up for 'user.created' events stopped firing 2 days ago. How do we debug this?",
        "context": "Webhook delivery issues can be diagnosed from Settings → Integrations → Webhooks → click the webhook → Delivery Log. The log shows the last 100 delivery attempts with status codes and response bodies. Common failure causes: (1) Your endpoint returned non-2xx for 50+ consecutive attempts — the webhook is auto-disabled after 50 failures. Check the Status field (Active/Disabled). (2) TLS certificate on your endpoint expired — we require valid TLS for webhook delivery. (3) Response timeout — your endpoint must respond within 5 seconds or the delivery is marked failed. To re-enable a disabled webhook: fix the underlying issue, then click 'Re-enable' in the webhook settings. Test delivery using the 'Send Test Event' button. Webhook events are retained for 7 days and can be replayed from the Delivery Log.",
        "ideal_response": "Go to Settings → Integrations → Webhooks → select your webhook → Delivery Log. Check if the webhook status is 'Disabled' — it auto-disables after 50 consecutive failures. Common causes: your endpoint returned non-2xx responses, your TLS certificate expired, or your endpoint exceeded the 5-second response timeout. Fix the issue, click 'Re-enable', and use 'Send Test Event' to verify. You can replay missed events from the Delivery Log (retained for 7 days).",
    },
    {
        "question": "We're evaluating your Enterprise plan. What's included beyond Pro, and can we do a trial?",
        "context": "Enterprise plan additions beyond Pro: (1) Dedicated infrastructure — single-tenant deployment in your preferred cloud region (AWS, GCP, Azure). (2) 99.99% SLA (vs 99.9% for Pro). (3) SAML SSO with custom attribute mapping. (4) Audit log API with 1-year retention (vs 90 days on Pro). (5) Priority support with 1-hour response SLA and dedicated account manager. (6) Custom rate limits (negotiable). (7) SOC 2 Type II and HIPAA BAA available. Pricing is custom and starts at $2,000/month for up to 50 seats. Enterprise trials: 30-day proof-of-concept available with full features on shared infrastructure. To start a trial, contact sales@futureagi.com or click 'Contact Sales' on the pricing page. No credit card required for the trial.",
        "ideal_response": "Enterprise adds: dedicated single-tenant infrastructure (AWS/GCP/Azure), 99.99% SLA, SAML SSO with custom attribute mapping, 1-year audit log retention, priority support with 1-hour SLA and dedicated account manager, custom rate limits, plus SOC 2 Type II and HIPAA BAA. Pricing starts at $2,000/month for up to 50 seats. You can get a 30-day full-feature trial on shared infrastructure — contact sales@futureagi.com or click 'Contact Sales' on the pricing page. No credit card needed.",
    },
    {
        "question": "How do I set up role-based access control so developers can use the API but can't change billing or invite users?",
        "context": "RBAC is configured under Settings → Team → Roles. Built-in roles: Owner (full access), Admin (everything except billing and ownership transfer), Developer (API access, project management, no team or billing access), Viewer (read-only dashboard access). Custom roles can be created on Enterprise plans only. To restrict developers: assign them the 'Developer' role. Developers can: create/manage API keys, access all API endpoints, create/manage projects, view usage analytics. Developers cannot: access Settings → Billing, invite/remove team members, change SSO configuration, access audit logs, or manage webhooks. Role changes take effect on the user's next login. Bulk role assignment is available via CSV upload under Settings → Team → Bulk Update.",
        "ideal_response": "Go to Settings → Team → Roles and assign the 'Developer' role. Developers get full API access and can create/manage API keys and projects, but cannot access billing, invite/remove team members, change SSO, access audit logs, or manage webhooks. Role changes apply on next login. For bulk updates, use CSV upload under Settings → Team → Bulk Update. Custom roles are available on Enterprise plans.",
    },
]

# Deliberately vague baseline — no structure, no constraints, will miss key details
baseline_prompt = "Help with this: {question}\n\nInfo: {context}"

# context_adherence and chunk_utilization need context + output
context_mapper = BasicDataMapper(
    key_map={
        "output":  "generated_output",
        "context": "context",
    }
)

# completeness needs input + output
completeness_mapper = BasicDataMapper(
    key_map={
        "input":  "question",
        "output": "generated_output",
    }
)

Set up evaluators with different metrics

A good support response needs to be faithful to the docs (context adherence), use the relevant info (chunk utilization), and fully answer the question (completeness). Each optimizer gets a different metric so you can compare how metric choice affects the winning prompt.

# Evaluator 1: context_adherence — does the response stick to the provided context?
context_adherence_evaluator = Evaluator(
    eval_template="context_adherence",
    eval_model_name="turing_flash",
)

# Evaluator 2: chunk_utilization — how effectively does the response use the context?
chunk_utilization_evaluator = Evaluator(
    eval_template="chunk_utilization",
    eval_model_name="turing_flash",
)

# Evaluator 3: completeness — does the response fully answer the question?
completeness_evaluator = Evaluator(
    eval_template="completeness",
    eval_model_name="turing_flash",
)

Any built-in eval template works here — the optimizer is metric-agnostic.

Run ProTeGi

ProTeGi generates localized edits to specific parts of the prompt, then tests each edit. It uses “textual gradients”: error-based feedback that guides targeted rewrites.

from fi.opt.optimizers import ProTeGi

teacher = LiteLLMGenerator(model="gpt-4o", prompt_template="{prompt}")

# Values kept low for a quick demo run (~5 min).
# For production optimization, increase: num_gradients=4, errors_per_gradient=4,
# beam_size=4, num_rounds=5, eval_subset_size=len(dataset)
protegi_optimizer = ProTeGi(
    teacher_generator=teacher,
    num_gradients=1,
    errors_per_gradient=1,
    prompts_per_gradient=1,
    beam_size=1,
)

print("Running ProTeGi with context adherence metric...")
protegi_result = protegi_optimizer.optimize(
    evaluator=context_adherence_evaluator,
    data_mapper=context_mapper,
    dataset=dataset,
    initial_prompts=[baseline_prompt],
    num_rounds=1,
    eval_subset_size=2,
)

print(f"ProTeGi score: {protegi_result.final_score:.3f}")
print(f"Rounds completed: {len(protegi_result.history)}")

Expected output:

Running ProTeGi with context adherence metric...
ProTeGi score: 0.892
Rounds completed: 1

Run GEPA with chunk utilization metric

GEPA uses an evolutionary approach — it breeds, mutates, and selects prompts over generations. It explores more diverse prompt styles than gradient-based methods.

from fi.opt.optimizers import GEPAOptimizer

gepa_optimizer = GEPAOptimizer(
    reflection_model="gpt-4o",      # powerful model for reflection and mutation
    generator_model="gpt-4o-mini",  # model used by the prompts being optimized
)

# max_metric_calls kept low for a quick demo. For real optimization, use 80-200.
print("Running GEPA with chunk utilization metric...")
gepa_result = gepa_optimizer.optimize(
    evaluator=chunk_utilization_evaluator,
    data_mapper=context_mapper,
    dataset=dataset,
    initial_prompts=[baseline_prompt],
    max_metric_calls=8,
)

print(f"GEPA score: {gepa_result.final_score:.3f}")
print(f"Rounds completed: {len(gepa_result.history)}")

Expected output:

Running GEPA with chunk utilization metric...
GEPA score: 0.871
Rounds completed: 2

Run PromptWizard with completeness metric

PromptWizard uses a 3-stage pipeline: mutate (generate prompt variants), score (evaluate candidates), and critique-refine (improve the best candidate). It applies different thinking styles (analytical, creative, step-by-step, etc.) during mutation for diverse candidates.

from fi.opt.optimizers import PromptWizardOptimizer

# Values kept low for a quick demo run (~1 min).
# For production optimization, increase: mutate_rounds=3, refine_iterations=2,
# eval_subset_size=len(dataset)
promptwizard_optimizer = PromptWizardOptimizer(
    teacher_generator=teacher,
    mutate_rounds=1,
    refine_iterations=1,
    beam_size=1,
)

print("Running PromptWizard with completeness metric...")
pw_result = promptwizard_optimizer.optimize(
    evaluator=completeness_evaluator,
    data_mapper=completeness_mapper,
    dataset=dataset,
    initial_prompts=[baseline_prompt],
    task_description="Generate a helpful, context-grounded customer support response that addresses all parts of the question.",
    eval_subset_size=2,
)

print(f"PromptWizard score: {pw_result.final_score:.3f}")
print(f"Rounds completed: {len(pw_result.history)}")

Expected output:

Running PromptWizard with completeness metric...
PromptWizard score: 0.914
Rounds completed: 4

Tip

The parameters above are intentionally minimal so this cookbook runs in a few minutes. For real optimization, increase the values as noted in the code comments: more rounds, larger beam sizes, and evaluating the full dataset will produce significantly better prompts.

Compare results

results = {
    "ProTeGi (context adherence)": protegi_result,
    "GEPA (chunk utilization)": gepa_result,
    "PromptWizard (completeness)": pw_result,
}

print("\n" + "=" * 56)
print(f"{'Strategy':<30} {'Score':>8} {'Rounds':>8}")
print("=" * 56)

for name, result in results.items():
    print(f"{name:<30} {result.final_score:>8.3f} {len(result.history):>8}")

print("=" * 56)

# Show the winning prompt from each strategy
for name, result in results.items():
    prompt = result.best_generator.get_prompt_template()
    print(f"\n--- {name} ---")
    print(prompt[:200] + ("..." if len(prompt) > 200 else ""))

# Show round-by-round history for the best performer
best_name = max(results, key=lambda k: results[k].final_score)
best_result = results[best_name]
print(f"\n--- {best_name} — round history ---")
for i, iteration in enumerate(best_result.history):
    print(f"  Round {i+1}: score={iteration.average_score:.3f}")

Example output:

========================================================
Strategy                          Score   Rounds
========================================================
ProTeGi (context adherence)       0.892        1
GEPA (chunk utilization)          0.871        2
PromptWizard (completeness)       0.914        4
========================================================

--- ProTeGi (context adherence) ---
You are a customer support agent. Answer the question using ONLY the information in the provided context. Be specific and include exact steps, numbers, or links where avail...

--- GEPA (chunk utilization) ---
As a friendly support agent, provide a clear, actionable answer to the customer's question. Use the context below as your knowledge base. Structure your response with the m...

--- PromptWizard (completeness) ---
You are an expert customer support agent. Your task is to answer the customer's question completely and accurately using the provided context. Include all relevant details: s...

--- PromptWizard (completeness) — round history ---
  Round 1: score=0.731
  Round 2: score=0.812
  Round 3: score=0.867
  Round 4: score=0.914

When to use which strategy

Strategy	Best for	Trade-off
ProTeGi	Targeted edits to a decent starting prompt	Fast convergence, but may miss globally different prompt structures
GEPA	Exploring diverse prompt styles from scratch	Broadest search space, but uses more evaluations
PromptWizard	Multi-stage refinement with critique feedback	Highest quality per evaluation, but slowest per round
MetaPrompt	General-purpose prompt rewriting	Good default — see Prompt Optimization
Bayesian Search	Few-shot example selection and ordering	Best when examples matter more than instructions — see Prompt Optimization
Random Search	Quick sanity check or baseline comparison	Lowest cost, useful to verify optimization adds value

What you built

You can now run and compare multiple optimization strategies on the same task to pick the best prompt for your use case.

Defined a customer support dataset with multi-constraint scenarios
Created three evaluators with different metrics (context adherence, chunk utilization, completeness)
Ran ProTeGi, GEPA, and PromptWizard on the same task
Compared winning prompts, scores, and round counts across strategies
Learned when to use each optimization strategy based on task characteristics

Questions & Discussion

Comparing ProTeGi, GEPA, and PromptWizard Optimization

Install

Define the dataset and baseline

Set up evaluators with different metrics

Run ProTeGi

Run GEPA with chunk utilization metric

Run PromptWizard with completeness metric

Compare results

When to use which strategy

What you built

Prompt Optimization

Dataset Optimization

Optimizers Overview

Experimentation