Purpose of Aggregated Metric Eval
- Provides a holistic evaluation by combining the strengths of different metrics e.g., BLEU for lexical overlap, ROUGE for recall-oriented matching, and Levenshtein for edit similarity. Useful when no single metric captures all aspects of quality.
- Supports custom weighting, allowing user to prioritize different metrics based on specific use-case (e.g., prioritizing factual accuracy vs. phrasing style).
Aggregated Metric using Future AGI’s Python SDK
Click here to learn how to setup evaluation using the Python SDK.Input & Configuration:
Parameter | Type | Description | |
---|---|---|---|
Required Inputs | response | str | Model-generated output to be evaluated. |
expected_text | str or List[str] | One or more reference texts. | |
Required Config | metrics | List[EvalTemplate] | A list of objects from evaluators class like BLEUScore() , ROUGEScore() , etc. |
metric_names | List[str] | Display names for each metric used. Must match length of metrics . | |
aggregator | str | Aggregation strategy. Options: "average" or "weighted_average" . | |
weights | List[float] | Required if aggregator="weighted_average" . Defines relative importance of each metric (should sum to 1). |
Parameter - aggregator | Description |
---|---|
"average" | Takes the mean of the normalized metric scores. |
"weighted_average" | Takes a weighted mean based on the weights . (e.g. 0.7 for BLEU, 0.3 for ROUGE) |
Output Field | Type | Description |
---|---|---|
score | float | Aggregated score between 0 and 1. |
What if Aggregated Score is Low?
- Diagnose individual metric output.
- Adjust weights as per the required use-case.