Eval Definition
ROUGE
Recall-specific measurement of lexical overlap between generated text and reference text
Overview
Unlike BLEU score, which focuses on precision, it emphasises on recall as much as precision. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) also measures number of overlapping n-grams between generated and reference text but reports them as F1-score, which is the harmonic mean of precision and recall, between 0 and 1.
ROUGE-N
- Measures n-gram overlap
- ROUGE-1: unigram
- ROUGE-2: bigram
ROUGE-L (Longest Common Subsequence)
- The longest sequence of words that appears in both the generated and reference texts in the same order (not necessarily contiguously).
- Better than fixed n-gram matching.
Calculation of ROUGE Scores
When to Use ROUGE?
- When recall is important in tasks such as in summarization tasks (did the model cover important parts?)
- Prefer ROUGE-L when structure and ordering matter but exact phrasing can vary.
ROUGE Score Eval using Future AGI’s Python SDK
Click here to learn how to setup evaluation using the Python SDK.
Input & Configuration:
Parameter | Type | Description | |
---|---|---|---|
Required Inputs | response | str | Model-generated output to be evaluated. |
expected_text | str | Reference text used for evaluation. | |
Optional Config | rouge_type | str | Variant of ROUGE to compute. Options: "rouge1" (default), "rouge2" , "rougeL" |
use_stemmer | bool | Whether to apply stemming before comparison. Default: True . |
Parameter Options:
Parameter - rouge_type | Description |
---|---|
"rouge1" | Compares unigrams (single words) |
"rouge2" | Compares bigrams (two-word sequences) |
"rougeL" | Uses longest common subsequence (LCS). Captures fluency and structural similarity. |
Parameter - use_stemmer | Description |
---|---|
True | Applies stemming (e.g., converts “running” to “run”). Helps tolerate morphological variation. |
False | Uses raw tokens without normalisation. More strict comparison. |
Output:
Output Field | Type | Description |
---|---|---|
precision | float | Fraction of predicted tokens that matched the reference. |
recall | float | Fraction of reference tokens that were found in the prediction. |
fmeasure | float | Harmonic mean of precision and recall. Represents the final ROUGE score. |
Example:
Output:
What if ROUGE Score is Low?
- Use
"rougeL"
if the phrasing of generated text is different but the meaning is preserved. - Apply
use_stemmer=True
to improve the robustness in word form variation. - Aggregate the ROUGE score with semantic evals like
Embedding Similarity
usingAggregated Metric
to have a holistic view of comparing generated text with reference text.