Skip to main content
result = evaluator.evaluate(
    eval_templates="rouge_score",
    inputs={
        "reference": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
        "hypothesis": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
    }
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)
Input
Required InputTypeDescription
referencestringThe reference containing the information to be captured.
hypothesisstringThe content to be evaluated for recall against the reference.
Output
FieldDescription
ResultReturns a score representing the recall of the hypothesis against the reference, where higher values indicate better recall.
ReasonProvides a detailed explanation of the recall evaluation.

About ROUGE Score

Unlike BLEU score, which focuses on precision, it emphasises on recall as much as precision. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) also measures number of overlapping n-grams between generated and reference text but reports them as F1-score, which is the harmonic mean of precision and recall, between 0 and 1.

ROUGE-N

  • Measures n-gram overlap
  • ROUGE-1: unigram
  • ROUGE-2: bigram

ROUGE-L (Longest Common Subsequence)

  • The longest sequence of words that appears in both the generated and reference texts in the same order (not necessarily contiguously).
  • Better than fixed n-gram matching.

Calculation of ROUGE Scores

Precision (P)=Number of overlapping unitsTotal units in candidate\hbox{Precision (P)} = { \hbox{Number of overlapping units} \over \hbox{Total units in candidate} } Recall (R)=Number of overlapping unitsTotal units in reference \hbox{Recall (R)} = { \hbox{Number of overlapping units} \over \hbox{Total units in reference} } F1-score (F)=2PRP+R\hbox{F1-score (F)} = { 2 \cdot P \cdot R \over P + R }

When to Use ROUGE?

  • When recall is important in tasks such as in summarization tasks (did the model cover important parts?)
  • Prefer ROUGE-L when structure and ordering matter but exact phrasing can vary.

What if ROUGE Score is Low?

  • Use "rougeL" if the phrasing of generated text is different but the meaning is preserved.
  • Apply use_stemmer=True to improve the robustness in word form variation.
  • Aggregate the ROUGE score with semantic evals like Embedding Similarity using Aggregated Metric to have a holistic view of comparing generated text with reference text.

I