Overview

Unlike BLEU score, which focuses on precision, it emphasises on recall as much as precision. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) also measures number of overlapping n-grams between generated and reference text but reports them as F1-score, which is the harmonic mean of precision and recall, between 0 and 1.

ROUGE-N

  • Measures n-gram overlap
  • ROUGE-1: unigram
  • ROUGE-2: bigram

ROUGE-L (Longest Common Subsequence)

  • The longest sequence of words that appears in both the generated and reference texts in the same order (not necessarily contiguously).
  • Better than fixed n-gram matching.

Calculation of ROUGE Scores

Precision (P)=Number of overlapping unitsTotal units in candidate\hbox{Precision (P)} = { \hbox{Number of overlapping units} \over \hbox{Total units in candidate} } Recall (R)=Number of overlapping unitsTotal units in reference \hbox{Recall (R)} = { \hbox{Number of overlapping units} \over \hbox{Total units in reference} } F1-score (F)=2PRP+R\hbox{F1-score (F)} = { 2 \cdot P \cdot R \over P + R }

When to Use ROUGE?

  • When recall is important in tasks such as in summarization tasks (did the model cover important parts?)
  • Prefer ROUGE-L when structure and ordering matter but exact phrasing can vary.

Evaluation Using SDK

Click here to learn how to setup evaluation using SDK.
Input & Configuration:
ParameterTypeDescription
Required InputsreferencestrModel-generated output to be evaluated.
hypothesisstrReference text used for evaluation.
Output:
Output FieldTypeDescription
precisionfloatFraction of predicted tokens that matched the reference.
recallfloatFraction of reference tokens that were found in the prediction.
fmeasurefloatHarmonic mean of precision and recall. Represents the final ROUGE score.
result = evaluator.evaluate(
    eval_templates="rouge_score",
    inputs={
        "reference": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
        "hypothesis": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
    }
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

What if ROUGE Score is Low?

  • Use "rougeL" if the phrasing of generated text is different but the meaning is preserved.
  • Apply use_stemmer=True to improve the robustness in word form variation.
  • Aggregate the ROUGE score with semantic evals like Embedding Similarity using Aggregated Metric to have a holistic view of comparing generated text with reference text.