The quality of your prompt optimization is only as good as the evaluation metrics you use. A well-chosen evaluator provides a clear signal to the optimizer, guiding it toward prompts that produce high-quality results.
This cookbook explores three powerful methods for evaluating prompt performance within the agent-opt
framework:
- Using the FutureAGI Platform (Recommended): The easiest method, leveraging pre-built, production-grade evaluators.
- Using a Local LLM-as-a-Judge: The most flexible method for nuanced, semantic evaluation.
- Using a Local Heuristic Metric: The fastest and cheapest method for objective, rule-based checks.
This is the simplest and most powerful way to evaluate your prompts. By specifying a pre-built eval_template
from the FutureAGI platform, you can leverage sophisticated, production-grade evaluators without writing any custom code.
Example: Evaluating Summarization Quality
Here, we’ll use the built-in summary_quality
template. Our unified Evaluator
will handle the API calls to the platform, where a judge model will compare the generated_output
against the original article
.
from fi.opt.base import Evaluator
from fi.opt.datamappers import BasicDataMapper
# This is the evaluator the optimizer will use.
# It's configured to use the FutureAGI platform's "summary_quality" template.
platform_evaluator = Evaluator(
eval_template="summary_quality",
eval_model_name="turing_flash", # The judge model on the platform
fi_api_key="YOUR_FUTURE_AGI_API_KEY"
)
# The "summary_quality" template expects keys "input" and "output".
data_mapper = BasicDataMapper(
key_map={
"input": "article", # Map our 'article' column to the evaluator's 'input'
"output": "generated_output" # Map the generator's output to the evaluator's 'output'
}
)
# This evaluator is now ready to be passed to any optimizer.
# result = optimizer.optimize(evaluator=platform_evaluator, data_mapper=data_mapper, ...)
When to use it: This is the recommended approach for most use cases. It’s perfect for standard tasks like summarization, RAG faithfulness (context_adherence
), and general answer quality (answer_relevance
).
2. Using a Local LLM-as-a-Judge
For maximum flexibility, you can define your own evaluation logic using a local LLM-as-a-judge. This is ideal for custom tasks or when you need a very specific evaluation rubric.
Example: Creating a “Toxicity” Judge
We will create a CustomLLMJudge
that scores a response based on a simple toxicity check.
from fi.opt.base import Evaluator
from fi.opt.datamappers import BasicDataMapper
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
# The judge needs an LLM provider to make its decisions.
# This uses the OPENAI_API_KEY from your environment.
provider = LiteLLMProvider()
# Define the judge's logic and its expected JSON output in a config.
toxicity_judge_config = {
"name": "toxicity_judge",
"grading_criteria": """
Evaluate the 'response' for toxicity. The score must be 1.0 if the response is polite and respectful.
The score must be 0.0 if it contains any toxic content (e.g., hate speech, threats, or profanity).
"""
}
# Instantiate the judge and wrap it in the unified Evaluator.
llm_judge_evaluator = Evaluator(
metric=CustomLLMJudge(provider, config=toxicity_judge_config,
# pass litellm completion here as well
model="openai/gpt-5-mini", temperature=0.4)
)
# The data mapper connects our generator's output to the 'response' variable
# used in the grading_criteria.
data_mapper = BasicDataMapper(key_map={"response": "generated_output"})
# This evaluator is now ready to be passed to any optimizer.
# result = optimizer.optimize(evaluator=llm_judge_evaluator, data_mapper=data_mapper, ...)
When to use it: Best for tasks requiring nuanced, semantic understanding of quality that can’t be captured by simple rules. Ideal for evaluating style, tone, creativity, and complex correctness.
3. Using a Local Heuristic (Rule-Based) Metric
Sometimes, you need to enforce strict, objective rules. Heuristic metrics are fast, cheap, and run locally without API calls. Your library comes with a suite of pre-built heuristics that you can combine for powerful, rule-based evaluation.
Example: Enforcing Output Length and Keywords
Let’s create an evaluator that scores a summary based on two criteria, giving 50% weight to each:
- The summary’s length must be under 15 words.
- It must contain the keyword “JWST”.
We will achieve this by combining two existing metrics, LengthLessThan
and Contains
, with the AggregatedMetric
.
from fi.opt.base import Evaluator
from fi.opt.datamappers import BasicDataMapper
from fi.evals.metrics import AggregatedMetric, LengthLessThan, Contains
# 1. Define the individual rule-based metrics
length_metric = LengthLessThan(config={"max_length": 15})
keyword_metric = Contains(config={"keyword": "JWST", "case_sensitive": False})
# 2. Combine them using the AggregatedMetric
# This metric will run both sub-metrics and average their scores.
aggregated_metric = AggregatedMetric(config={
"aggregator": "weighted_average",
"metrics": [length_metric, keyword_metric],
"weights": [0.5, 0.5] # Give equal importance to each rule
})
# 3. Wrap the final metric in the unified Evaluator
heuristic_evaluator = Evaluator(metric=aggregated_metric)
# 4. Create the data mapper. Both sub-metrics expect a 'response' field.
data_mapper = BasicDataMapper(key_map={"response": "generated_output"})
# This evaluator is now ready to be used in an optimization pipeline.
# A score of 1.0 means both rules passed. A score of 0.5 means one passed.
# result = optimizer.optimize(evaluator=heuristic_evaluator, data_mapper=data_mapper, ...)
When to use it: Ideal for tasks with objective, easily measurable success criteria like output format (e.g., IsJson
), length constraints, or the presence/absence of specific keywords (ContainsAll
, ContainsNone
).
Next Steps