Overview

Unlike BLEU score, which focuses on precision, it emphasises on recall as much as precision. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) also measures number of overlapping n-grams between generated and reference text but reports them as F1-score, which is the harmonic mean of precision and recall, between 0 and 1.


ROUGE-N

  • Measures n-gram overlap
  • ROUGE-1: unigram
  • ROUGE-2: bigram

ROUGE-L (Longest Common Subsequence)

  • The longest sequence of words that appears in both the generated and reference texts in the same order (not necessarily contiguously).
  • Better than fixed n-gram matching.

Calculation of ROUGE Scores

Precision (P)=Number of overlapping unitsTotal units in candidate\hbox{Precision (P)} = { \hbox{Number of overlapping units} \over \hbox{Total units in candidate} } Recall (R)=Number of overlapping unitsTotal units in reference \hbox{Recall (R)} = { \hbox{Number of overlapping units} \over \hbox{Total units in reference} } F1-score (F)=2PRP+R\hbox{F1-score (F)} = { 2 \cdot P \cdot R \over P + R }

When to Use ROUGE?

  • When recall is important in tasks such as in summarization tasks (did the model cover important parts?)
  • Prefer ROUGE-L when structure and ordering matter but exact phrasing can vary.

ROUGE Score Eval using Future AGI’s Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input & Configuration:

ParameterTypeDescription
Required InputsresponsestrModel-generated output to be evaluated.
expected_textstrReference text used for evaluation.
Optional Configrouge_typestrVariant of ROUGE to compute. Options: "rouge1" (default), "rouge2", "rougeL"
use_stemmerboolWhether to apply stemming before comparison. Default: True.

Parameter Options:

Parameter - rouge_typeDescription
"rouge1"Compares unigrams (single words)
"rouge2"Compares bigrams (two-word sequences)
"rougeL"Uses longest common subsequence (LCS). Captures fluency and structural similarity.
Parameter - use_stemmerDescription
TrueApplies stemming (e.g., converts “running” to “run”). Helps tolerate morphological variation.
FalseUses raw tokens without normalisation. More strict comparison.

Output:

Output FieldTypeDescription
precisionfloatFraction of predicted tokens that matched the reference.
recallfloatFraction of reference tokens that were found in the prediction.
fmeasurefloatHarmonic mean of precision and recall. Represents the final ROUGE score.

Example:

from fi.evals.metrics import ROUGEScore
from fi.testcases import TestCase

test_case = TestCase(
    response="The quick brown fox jumps over the lazy dog",
    expected_text="The fox leaps over the sleepy dog"
)

rouge_evaluator = ROUGEScore(config={
    "rouge_type": "rouge2",
    "use_stemmer": True
})

result = rouge_evaluator.evaluate([test_case])
metrics = result.eval_results[0].metrics

for m in metrics:
    print(f"{m.id}: {m.value:.4f}")

Output:

rouge_rouge2_precision: 0.1250
rouge_rouge2_recall: 0.1667
rouge_rouge2_fmeasure: 0.1429

What if ROUGE Score is Low?

  • Use "rougeL" if the phrasing of generated text is different but the meaning is preserved.
  • Apply use_stemmer=True to improve the robustness in word form variation.
  • Aggregate the ROUGE score with semantic evals like Embedding Similarity using Aggregated Metric to have a holistic view of comparing generated text with reference text.