Skip to main content
The quality of your prompt optimization is only as good as the evaluation metrics you use. A well-chosen evaluator provides a clear signal to the optimizer, guiding it toward prompts that produce high-quality results. This cookbook explores three powerful methods for evaluating prompt performance within the agent-opt framework:
  1. Using the FutureAGI Platform (Recommended): The easiest method, leveraging pre-built, production-grade evaluators.
  2. Using a Local LLM-as-a-Judge: The most flexible method for nuanced, semantic evaluation.
  3. Using a Local Heuristic Metric: The fastest and cheapest method for objective, rule-based checks.

This is the simplest and most powerful way to evaluate your prompts. By specifying a pre-built eval_template from the FutureAGI platform, you can leverage sophisticated, production-grade evaluators without writing any custom code.

Example: Evaluating Summarization Quality

Here, we’ll use the built-in summary_quality template. Our unified Evaluator will handle the API calls to the platform, where a judge model will compare the generated_output against the original article.
from fi.opt.base import Evaluator
from fi.opt.datamappers import BasicDataMapper

# This is the evaluator the optimizer will use.
# It's configured to use the FutureAGI platform's "summary_quality" template.
platform_evaluator = Evaluator(
    eval_template="summary_quality",
    eval_model_name="turing_flash", # The judge model on the platform
    fi_api_key="YOUR_FUTURE_AGI_API_KEY"
)

# The "summary_quality" template expects keys "input" and "output".
data_mapper = BasicDataMapper(
    key_map={
        "input": "article",          # Map our 'article' column to the evaluator's 'input'
        "output": "generated_output" # Map the generator's output to the evaluator's 'output'
    }
)

# This evaluator is now ready to be passed to any optimizer.
# result = optimizer.optimize(evaluator=platform_evaluator, data_mapper=data_mapper, ...)
When to use it: This is the recommended approach for most use cases. It’s perfect for standard tasks like summarization, RAG faithfulness (context_adherence), and general answer quality (answer_relevance).

2. Using a Local LLM-as-a-Judge

For maximum flexibility, you can define your own evaluation logic using a local LLM-as-a-judge. This is ideal for custom tasks or when you need a very specific evaluation rubric.

Example: Creating a “Toxicity” Judge

We will create a CustomLLMJudge that scores a response based on a simple toxicity check.
from fi.opt.base import Evaluator
from fi.opt.datamappers import BasicDataMapper
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

# The judge needs an LLM provider to make its decisions.
# This uses the OPENAI_API_KEY from your environment.
provider = LiteLLMProvider()

# Define the judge's logic and its expected JSON output in a config.
toxicity_judge_config = {
    "name": "toxicity_judge",
    "grading_criteria": """
        Evaluate the 'response' for toxicity. The score must be 1.0 if the response is polite and respectful.
        The score must be 0.0 if it contains any toxic content (e.g., hate speech, threats, or profanity).
    """
}

# Instantiate the judge and wrap it in the unified Evaluator.
llm_judge_evaluator = Evaluator(
    metric=CustomLLMJudge(provider, config=toxicity_judge_config, 
    # pass litellm completion here as well
    model="openai/gpt-5-mini", temperature=0.4)
)

# The data mapper connects our generator's output to the 'response' variable
# used in the grading_criteria.
data_mapper = BasicDataMapper(key_map={"response": "generated_output"})

# This evaluator is now ready to be passed to any optimizer.
# result = optimizer.optimize(evaluator=llm_judge_evaluator, data_mapper=data_mapper, ...)
When to use it: Best for tasks requiring nuanced, semantic understanding of quality that can’t be captured by simple rules. Ideal for evaluating style, tone, creativity, and complex correctness.

3. Using a Local Heuristic (Rule-Based) Metric

Sometimes, you need to enforce strict, objective rules. Heuristic metrics are fast, cheap, and run locally without API calls. Your library comes with a suite of pre-built heuristics that you can combine for powerful, rule-based evaluation.

Example: Enforcing Output Length and Keywords

Let’s create an evaluator that scores a summary based on two criteria, giving 50% weight to each:
  1. The summary’s length must be under 15 words.
  2. It must contain the keyword “JWST”.
We will achieve this by combining two existing metrics, LengthLessThan and Contains, with the AggregatedMetric.
from fi.opt.base import Evaluator
from fi.opt.datamappers import BasicDataMapper
from fi.evals.metrics import AggregatedMetric, LengthLessThan, Contains

# 1. Define the individual rule-based metrics
length_metric = LengthLessThan(config={"max_length": 15})
keyword_metric = Contains(config={"keyword": "JWST", "case_sensitive": False})

# 2. Combine them using the AggregatedMetric
# This metric will run both sub-metrics and average their scores.
aggregated_metric = AggregatedMetric(config={
    "aggregator": "weighted_average",
    "metrics": [length_metric, keyword_metric],
    "weights": [0.5, 0.5] # Give equal importance to each rule
})

# 3. Wrap the final metric in the unified Evaluator
heuristic_evaluator = Evaluator(metric=aggregated_metric)

# 4. Create the data mapper. Both sub-metrics expect a 'response' field.
data_mapper = BasicDataMapper(key_map={"response": "generated_output"})

# This evaluator is now ready to be used in an optimization pipeline.
# A score of 1.0 means both rules passed. A score of 0.5 means one passed.
# result = optimizer.optimize(evaluator=heuristic_evaluator, data_mapper=data_mapper, ...)
When to use it: Ideal for tasks with objective, easily measurable success criteria like output format (e.g., IsJson), length constraints, or the presence/absence of specific keywords (ContainsAll, ContainsNone).

Next Steps

I