Overview
Unlike BLEU score, which focuses on precision, it emphasises on recall as much as precision. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) also measures number of overlapping n-grams between generated and reference text but reports them as F1-score, which is the harmonic mean of precision and recall, between 0 and 1.ROUGE-N
- Measures n-gram overlap
- ROUGE-1: unigram
- ROUGE-2: bigram
ROUGE-L (Longest Common Subsequence)
- The longest sequence of words that appears in both the generated and reference texts in the same order (not necessarily contiguously).
- Better than fixed n-gram matching.
Calculation of ROUGE Scores
When to Use ROUGE?
- When recall is important in tasks such as in summarization tasks (did the model cover important parts?)
- Prefer ROUGE-L when structure and ordering matter but exact phrasing can vary.
Evaluation Using SDK
Click here to learn how to setup evaluation using SDK.Input & Configuration:
Parameter | Type | Description | |
---|---|---|---|
Required Inputs | reference | str | Model-generated output to be evaluated. |
hypothesis | str | Reference text used for evaluation. |
Output Field | Type | Description |
---|---|---|
precision | float | Fraction of predicted tokens that matched the reference. |
recall | float | Fraction of reference tokens that were found in the prediction. |
fmeasure | float | Harmonic mean of precision and recall. Represents the final ROUGE score. |
What if ROUGE Score is Low?
- Use
"rougeL"
if the phrasing of generated text is different but the meaning is preserved. - Apply
use_stemmer=True
to improve the robustness in word form variation. - Aggregate the ROUGE score with semantic evals like
Embedding Similarity
usingAggregated Metric
to have a holistic view of comparing generated text with reference text.