ROUGE Score

Recall-specific measurement of lexical overlap between generated hypothesis and reference

result = evaluator.evaluate(
    eval_templates="rouge_score",
    inputs={
        "reference": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
        "hypothesis": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
    }
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

import { Evaluator, Templates } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
  "rouge_score",
  {
    reference: "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
    hypothesis: "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
  },
  {
    modelName: "turing_flash",
  }
);

console.log(result);


Required Input	Type	Description
`reference`	`string`	The reference containing the information to be captured.
`hypothesis`	`string`	The content to be evaluated for recall against the reference.

Output
	Field	Description
	Result	Returns a score representing the recall of the hypothesis against the reference, where higher values indicate better recall.
	Reason	Provides a detailed explanation of the recall evaluation.

About ROUGE Score

Unlike BLEU score, which focuses on precision, it emphasises on recall as much as precision. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) also measures number of overlapping n-grams between generated and reference text but reports them as F1-score, which is the harmonic mean of precision and recall, between 0 and 1.

ROUGE-N

Measures n-gram overlap
ROUGE-1: unigram
ROUGE-2: bigram

ROUGE-L (Longest Common Subsequence)

The longest sequence of words that appears in both the generated and reference texts in the same order (not necessarily contiguously).
Better than fixed n-gram matching.

Calculation of ROUGE Scores

Precision (P) = Number of overlapping units Total units in candidate

Recall (R) = Number of overlapping units Total units in reference

F1-score (F) = 2 × P × R P + R

When to Use ROUGE?

When recall is important in tasks such as in summarization tasks (did the model cover important parts?)
Prefer ROUGE-L when structure and ordering matter but exact phrasing can vary.

What if ROUGE Score is Low?

Use "rougeL" if the phrasing of generated text is different but the meaning is preserved.
Apply use_stemmer=True to improve the robustness in word form variation.
Aggregate the ROUGE score with semantic evals like Embedding Similarity using Aggregated Metric to have a holistic view of comparing generated text with reference text.

Was this page helpful?

FutureAGI AI Assistant