ROUGE

Overview

Unlike BLEU score, which focuses on precision, it emphasises on recall as much as precision. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) also measures number of overlapping n-grams between generated and reference text but reports them as F1-score, which is the harmonic mean of precision and recall, between 0 and 1.

ROUGE-N

Measures n-gram overlap
ROUGE-1: unigram
ROUGE-2: bigram

ROUGE-L (Longest Common Subsequence)

The longest sequence of words that appears in both the generated and reference texts in the same order (not necessarily contiguously).
Better than fixed n-gram matching.

Calculation of ROUGE Scores

\hbox{Precision (P)} = { \hbox{Number of overlapping units} \over \hbox{Total units in candidate} }

\hbox{Recall (R)} = { \hbox{Number of overlapping units} \over \hbox{Total units in reference} }

\hbox{F1-score (F)} = { 2 \cdot P \cdot R \over P + R }

When to Use ROUGE?

When recall is important in tasks such as in summarization tasks (did the model cover important parts?)
Prefer ROUGE-L when structure and ordering matter but exact phrasing can vary.

Evaluation Using SDK

Click here to learn how to setup evaluation using SDK.

Input & Configuration:

	Parameter	Type	Description
Required Inputs	`reference`	`str`	Model-generated output to be evaluated.
	`hypothesis`	`str`	Reference text used for evaluation.

Output:

Output Field	Type	Description
`precision`	`float`	Fraction of predicted tokens that matched the reference.
`recall`	`float`	Fraction of reference tokens that were found in the prediction.
`fmeasure`	`float`	Harmonic mean of precision and recall. Represents the final ROUGE score.

result = evaluator.evaluate(
    eval_templates="rouge_score",
    inputs={
        "reference": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
        "hypothesis": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
    }
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

What if ROUGE Score is Low?

Use "rougeL" if the phrasing of generated text is different but the meaning is preserved.
Apply use_stemmer=True to improve the robustness in word form variation.
Aggregate the ROUGE score with semantic evals like Embedding Similarity using Aggregated Metric to have a holistic view of comparing generated text with reference text.

Introduction

Evaluation

Simulations

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

Overview

ROUGE-N

ROUGE-L (Longest Common Subsequence)

Calculation of ROUGE Scores

When to Use ROUGE?

Evaluation Using SDK

What if ROUGE Score is Low?

Introduction

Evaluation

Simulations

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

​Overview

​ROUGE-N

​ROUGE-L (Longest Common Subsequence)

​Calculation of ROUGE Scores

​When to Use ROUGE?

​Evaluation Using SDK

​What if ROUGE Score is Low?

Overview

ROUGE-N

ROUGE-L (Longest Common Subsequence)

Calculation of ROUGE Scores

When to Use ROUGE?

Evaluation Using SDK

What if ROUGE Score is Low?