ROUGE Score
Recall-specific measurement of lexical overlap between generated hypothesis and reference
result = evaluator.evaluate(
eval_templates="rouge_score",
inputs={
"reference": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
"hypothesis": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
}
)
print(result.eval_results[0].output)
print(result.eval_results[0].reason)import { Evaluator, Templates } from "@future-agi/ai-evaluation";
const evaluator = new Evaluator();
const result = await evaluator.evaluate(
"rouge_score",
{
reference: "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
hypothesis: "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
},
{
modelName: "turing_flash",
}
);
console.log(result); | Input | |||
|---|---|---|---|
| Required Input | Type | Description | |
reference | string | The reference containing the information to be captured. | |
hypothesis | string | The content to be evaluated for recall against the reference. |
| Output | ||
|---|---|---|
| Field | Description | |
| Result | Returns a score representing the recall of the hypothesis against the reference, where higher values indicate better recall. | |
| Reason | Provides a detailed explanation of the recall evaluation. |
About ROUGE Score
Unlike BLEU score, which focuses on precision, it emphasises on recall as much as precision. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) also measures number of overlapping n-grams between generated and reference text but reports them as F1-score, which is the harmonic mean of precision and recall, between 0 and 1.
ROUGE-N
- Measures n-gram overlap
- ROUGE-1: unigram
- ROUGE-2: bigram
ROUGE-L (Longest Common Subsequence)
- The longest sequence of words that appears in both the generated and reference texts in the same order (not necessarily contiguously).
- Better than fixed n-gram matching.
Calculation of ROUGE Scores
Precision (P) = Number of overlapping units Total units in candidate
Recall (R) = Number of overlapping units Total units in reference
F1-score (F) = 2 × P × R P + R
When to Use ROUGE?
- When recall is important in tasks such as in summarization tasks (did the model cover important parts?)
- Prefer ROUGE-L when structure and ordering matter but exact phrasing can vary.
What if ROUGE Score is Low?
- Use
"rougeL"if the phrasing of generated text is different but the meaning is preserved. - Apply
use_stemmer=Trueto improve the robustness in word form variation. - Aggregate the ROUGE score with semantic evals like
Embedding SimilarityusingAggregated Metricto have a holistic view of comparing generated text with reference text.
Was this page helpful?