Input | |||
---|---|---|---|
Required Input | Type | Description | |
reference | string | The reference containing the information to be captured. | |
hypothesis | string | The content to be evaluated for recall against the reference. |
Output | ||
---|---|---|
Field | Description | |
Result | Returns a score representing the recall of the hypothesis against the reference, where higher values indicate better recall. | |
Reason | Provides a detailed explanation of the recall evaluation. |
About ROUGE Score
Unlike BLEU score, which focuses on precision, it emphasises on recall as much as precision. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) also measures number of overlapping n-grams between generated and reference text but reports them as F1-score, which is the harmonic mean of precision and recall, between 0 and 1.ROUGE-N
- Measures n-gram overlap
- ROUGE-1: unigram
- ROUGE-2: bigram
ROUGE-L (Longest Common Subsequence)
- The longest sequence of words that appears in both the generated and reference texts in the same order (not necessarily contiguously).
- Better than fixed n-gram matching.
Calculation of ROUGE Scores
When to Use ROUGE?
- When recall is important in tasks such as in summarization tasks (did the model cover important parts?)
- Prefer ROUGE-L when structure and ordering matter but exact phrasing can vary.
What if ROUGE Score is Low?
- Use
"rougeL"
if the phrasing of generated text is different but the meaning is preserved. - Apply
use_stemmer=True
to improve the robustness in word form variation. - Aggregate the ROUGE score with semantic evals like
Embedding Similarity
usingAggregated Metric
to have a holistic view of comparing generated text with reference text.