1. Introduction
Prompt optimization appears simple, just adjust instructions until outputs improve, but in production, this approach consistently fails. The first failure is the lack of evaluation baselines. Most teams do not have a stable, quantitative way to determine whether a prompt change is an improvement or a regression. Outputs are inspected manually, sampled inconsistently, and judged subjectively. Once behavior degrades, there is no reference point to diagnose why. The second failure is reproducibility. Prompt changes are rarely versioned, benchmarked, or evaluated in a controlled manner. Improvements cannot be reliably reproduced across environments, team members, or time. As a result, prompt behavior becomes fragile and difficult to defend. The third failure is iteration cost. Prompt refinement is performed through manual loops: edit, test a few examples, repeat. This process does not scale with dataset size, task complexity, or organizational velocity. As systems grow, iteration slows and confidence erodes. The final failure is brittleness. Over time, prompts accumulate ad-hoc fixes for edge cases. Each fix introduces new interactions, making the prompt increasingly unstable. Small changes cause unexpected regressions, and prompt engineering devolves into reactive patching. Prompt engineering relies on human intuition and local testing. This is sufficient for prototypes. It breaks down when prompts must satisfy diverse inputs, strict correctness requirements, and cost constraints simultaneously. At that point, prompt behavior must be managed as a system, not as text.2. Prompt Optimization as a First-Class Workflow
Even when prompt optimisers are available, using them requires stitching together evaluation logic, tracking prompt versions, comparing runs, and managing iteration manually. These steps are rarely standardised and are often handled through scripts, notebooks, or human judgement As a result, optimization is slow, inconsistent, and difficult to repeat. Teams either stop optimizing or limit it to one-off experiments. Future AGI removes this operational burden by making prompt optimization a built-in workflow rather than a custom system. Outputs are scored consistently, prompt versions and results are stored and comparable, optimization loops are handled by the platform and improved prompts are ranked and returned automatically. This allows teams to focus on defining behavior and success criteria, instead of building and maintaining optimization infrastructure. Using Future AGI, prompt optimization is reduced to a small set of decisions:- what behavior to evaluate (by creating dataset)
- how success is measured (by defining evaluator)
- how improvement is explored (by choosing optimiser)
3. Prompt Optimization using Future AGI
This section defines all required components to run prompt optimization using Future AGI. Each step introduces one concrete object, explains its role briefly, and shows the exact code required.1
Install Required Packages
Install the
agent-opt package to get started with prompt optimization.2
Set Environment Variables
These credentials are required to run evaluations and track optimization results in Future AGI. Click here to get your API keys.
3
Prepare the Dataset
The dataset defines the inputs against which prompt performance will be evaluated.
4
Define the Prompt and Generator
Provide the initial prompt that will be optimized. The generator binds the prompt to a model configuration.
5
Configure the Evaluator
The evaluator defines how output quality is measured. It acts as the objective function for optimization.
We are using one of Future AGI’s builtin eval called
summary_quality. Click here to learn what other builtin evals Future AGI offers.6
Configure the DataMapper
The DataMapper connects dataset fields to evaluator inputs.
7
Select the Optimizer
The optimizer defines how prompt variants are generated and evaluated.Future AGI supports multiple prompt optimization strategies, all accessible through the same workflow. A full, up-to-date overview of supported optimizers is available in the documentation. Click here to learn more.At a high level, commonly used optimizers include:
- Random Search – fast baseline exploration
- Bayesian Search – structured optimization for few-shot prompts
- ProTeGi – targeted refinement for recurring failure patterns
- Meta-Prompt – higher-level prompt rewrites
- GEPA – evolutionary optimization for production-grade quality
Switching optimizers does not change the workflow.
8
Run Prompt Optimization
Execute the optimization process with all configured components.
Once these steps are complete, Future AGI automatically handles:
- Evaluation execution
- Optimization loops
- Experiment tracking
- Prompt versioning
- Result comparison and ranking