Purpose of Aggregated Metric Eval

  • Provides a holistic evaluation by combining the strengths of different metrics e.g., BLEU for lexical overlap, ROUGE for recall-oriented matching, and Levenshtein for edit similarity. Useful when no single metric captures all aspects of quality.
  • Supports custom weighting, allowing user to prioritize different metrics based on specific use-case (e.g., prioritizing factual accuracy vs. phrasing style).

Aggregated Metric using Future AGI’s Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input & Configuration:

ParameterTypeDescription
Required InputsresponsestrModel-generated output to be evaluated.
expected_textstr or List[str]One or more reference texts.
Required ConfigmetricsList[EvalTemplate]A list of objects from evaluators class like BLEUScore(), ROUGEScore(), etc.
metric_namesList[str]Display names for each metric used. Must match length of metrics.
aggregatorstrAggregation strategy. Options: "average" or "weighted_average".
weightsList[float]Required if aggregator="weighted_average". Defines relative importance of each metric (should sum to 1).

Parameter Options:

Parameter - aggregatorDescription
"average"Takes the mean of the normalized metric scores.
"weighted_average"Takes a weighted mean based on the weights. (e.g. 0.7 for BLEU, 0.3 for ROUGE)

Output:

Output FieldTypeDescription
scorefloatAggregated score between 0 and 1.

Example:

from fi.evals.metrics import BLEUScore, ROUGEScore, LevenshteinDistance, AggregatedMetric
from fi.testcases import TestCase

# Test input
test_case = TestCase(
    response="The quick brown fox jumps over the lazy dog.",
    expected_text="quick brown fox jumps over the lazy dog."
)

# Instantiate metrics
bleu = BLEUScore()
rouge = ROUGEScore(config={"rouge_type": "rouge1"})
levenshtein = LevenshteinDistance()

# 1. Simple average
avg_metric = AggregatedMetric(config={
    "metrics": [bleu, rouge],
    "metric_names": ["bleu", "rouge1"],
    "aggregator": "average"
})

# 2. Weighted average (70% BLEU, 30% ROUGE)
weighted_metric = AggregatedMetric(config={
    "metrics": [bleu, rouge],
    "metric_names": ["bleu", "rouge1"],
    "aggregator": "weighted_average",
    "weights": [0.7, 0.3]
})

# 3. Average with BLEU, ROUGE, Levenshtein
combined_metric = AggregatedMetric(config={
    "metrics": [bleu, rouge, levenshtein],
    "metric_names": ["bleu", "rouge1", "levenshtein"],
    "aggregator": "average"
})

# Run evaluation
for label, metric in {
    "BLEU + ROUGE (Average)": avg_metric,
    "BLEU + ROUGE (Weighted)": weighted_metric,
    "BLEU + ROUGE + Levenshtein (Average)": combined_metric
}.items():
    result = metric.evaluate([test_case])
    score = result.eval_results[0].metrics[0].value
    metadata = result.eval_results[0].metadata
    print(f"\n{label}")
    print(f"Aggregated Score: {score:.4f}")

Output:

BLEU + ROUGE (Average)
Aggregated Score: 0.8761

BLEU + ROUGE (Weighted)
Aggregated Score: 0.8710

BLEU + ROUGE + Levenshtein (Average)
Aggregated Score: 0.6144

What if Aggregated Score is Low?

  • Diagnose individual metric output.
  • Adjust weights as per the required use-case.