Skip to main content

1. Introduction

Prompt optimization appears simple, just adjust instructions until outputs improve, but in production, this approach consistently fails. The first failure is the lack of evaluation baselines. Most teams do not have a stable, quantitative way to determine whether a prompt change is an improvement or a regression. Outputs are inspected manually, sampled inconsistently, and judged subjectively. Once behavior degrades, there is no reference point to diagnose why. The second failure is reproducibility. Prompt changes are rarely versioned, benchmarked, or evaluated in a controlled manner. Improvements cannot be reliably reproduced across environments, team members, or time. As a result, prompt behavior becomes fragile and difficult to defend. The third failure is iteration cost. Prompt refinement is performed through manual loops: edit, test a few examples, repeat. This process does not scale with dataset size, task complexity, or organizational velocity. As systems grow, iteration slows and confidence erodes. The final failure is brittleness. Over time, prompts accumulate ad-hoc fixes for edge cases. Each fix introduces new interactions, making the prompt increasingly unstable. Small changes cause unexpected regressions, and prompt engineering devolves into reactive patching. Prompt engineering relies on human intuition and local testing. This is sufficient for prototypes. It breaks down when prompts must satisfy diverse inputs, strict correctness requirements, and cost constraints simultaneously. At that point, prompt behavior must be managed as a system, not as text.

2. Prompt Optimization as a First-Class Workflow

Even when prompt optimisers are available, using them requires stitching together evaluation logic, tracking prompt versions, comparing runs, and managing iteration manually. These steps are rarely standardised and are often handled through scripts, notebooks, or human judgement As a result, optimization is slow, inconsistent, and difficult to repeat. Teams either stop optimizing or limit it to one-off experiments. Future AGI removes this operational burden by making prompt optimization a built-in workflow rather than a custom system. Outputs are scored consistently, prompt versions and results are stored and comparable, optimization loops are handled by the platform and improved prompts are ranked and returned automatically. This allows teams to focus on defining behavior and success criteria, instead of building and maintaining optimization infrastructure. Using Future AGI, prompt optimization is reduced to a small set of decisions:
  • what behavior to evaluate (by creating dataset)
  • how success is measured (by defining evaluator)
  • how improvement is explored (by choosing optimiser)
Once these are defined, optimization runs as a single execution step. Prompt optimization stops being a research problem and becomes an execution problem.

3. Prompt Optimization using Future AGI

This section defines all required components to run prompt optimization using Future AGI. Each step introduces one concrete object, explains its role briefly, and shows the exact code required.
1

Install Required Packages

Install the agent-opt package to get started with prompt optimization.
pip install agent-opt
2

Set Environment Variables

These credentials are required to run evaluations and track optimization results in Future AGI. Click here to get your API keys.
import os

os.environ["FI_API_KEY"] ="YOUR_API_KEY"
os.environ["FI_SECRET_KEY"] ="YOUR_SECRET_KEY"
3

Prepare the Dataset

The dataset defines the inputs against which prompt performance will be evaluated.
dataset = [
    {
"article":"The James Webb Space Telescope captured detailed images of the Pillars of Creation.",
"target_summary":"JWST captured new detailed images of the Pillars of Creation."
    },
    {
"article":"Researchers discovered an enzyme that rapidly breaks down plastic.",
"target_summary":"A newly discovered enzyme rapidly breaks down plastic."
    }
]
4

Define the Prompt and Generator

Provide the initial prompt that will be optimized. The generator binds the prompt to a model configuration.
from fi.opt.generatorsimport LiteLLMGenerator

prompt_template ="Summarize this: {article}"

generator = LiteLLMGenerator(
    model="gpt-4o-mini",
    prompt_template=prompt_template
)
5

Configure the Evaluator

The evaluator defines how output quality is measured. It acts as the objective function for optimization.
from fi.opt.base.evaluatorimport Evaluator

evaluator = Evaluator(
    eval_template="summary_quality",
    eval_model_name="turing_flash"
)
We are using one of Future AGI’s builtin eval called summary_quality. Click here to learn what other builtin evals Future AGI offers.
For maximum flexibility, you can define your own evaluation logic using a local LLM-as-a-judge. This is ideal for custom tasks or when you need a very specific evaluation rubric. Click here to learn more.
6

Configure the DataMapper

The DataMapper connects dataset fields to evaluator inputs.
from fi.opt.datamappersimport BasicDataMapper

data_mapper = BasicDataMapper(
    key_map={
"input":"article",
"output":"generated_output"
    }
)
7

Select the Optimizer

The optimizer defines how prompt variants are generated and evaluated.Future AGI supports multiple prompt optimization strategies, all accessible through the same workflow. A full, up-to-date overview of supported optimizers is available in the documentation. Click here to learn more.At a high level, commonly used optimizers include:
  • Random Search – fast baseline exploration
  • Bayesian Search – structured optimization for few-shot prompts
  • ProTeGi – targeted refinement for recurring failure patterns
  • Meta-Prompt – higher-level prompt rewrites
  • GEPA – evolutionary optimization for production-grade quality
Switching optimizers does not change the workflow.
from fi.opt.optimizersimport RandomSearchOptimizer

optimizer = RandomSearchOptimizer(
    generator=generator,
    teacher_model="gpt-4o",
    num_variations=5
)
8

Run Prompt Optimization

Execute the optimization process with all configured components.
result = optimizer.optimize(
    evaluator=evaluator,
    data_mapper=data_mapper,
    dataset=dataset
)
Once these steps are complete, Future AGI automatically handles:
  • Evaluation execution
  • Optimization loops
  • Experiment tracking
  • Prompt versioning
  • Result comparison and ranking

Conclusion

Prompt optimization becomes difficult when it is treated as an informal, intuition-driven activity. It becomes manageable when prompts are evaluated systematically, and improved through explicit feedback loops. Future AGI removes the operational complexity by incorporating evaluation, iteration, comparison, and bookkeeping. What remains is a small set of explicit inputs and a single execution step. As a result, prompt optimization shifts from a research exercise to a routine engineering workflow, which is repeatable, auditable, and easy to operate at scale.

FAQ

1. Do I need to write custom evaluation logic? No. Future AGI provides 60+ built-in evaluators and supports LLM-as-a-Judge patterns out of the box. Evaluation execution, scoring, and aggregation are handled by the platform. 2. Does switching optimizers require changing my workflow? No. The workflow remains the same. Switching optimizers changes a single configuration line; the dataset, evaluator, data mapping, and execution flow do not change. 3. Can we save this optimized prompt as a prompt templates in Future AGI platform? Yes, by using prompt SDK, the output can be stored as a new template version and managed like any other prompt artifact. Click here to learn more.

Ready to Systematically Optimize Prompt?

Start incorporating prompt optimization in your production AI systems using Future AGI. Future AGI provides the evaluation and optimization infrastructure required to build reliable, explainable, and production-ready LLM applications. Click here to schedule a demo with us now!