BLEU

Overview

BLEU (Bilingual Evaluation Understudy) is a lexical-level eval that evaluates how many contiguous sequence of words (n-grams) in the generated text are also present in the reference text. It gives a numeric score between 0 and 1 quantifying how much the generated text looks like the reference text. Higher the score the more similar the generated text is to the reference text.

n-gram

An n-gram is contiguous sequence of “n” words in a row.
For example, "The quick brown fox jumps over the lazy dog"
- 1-grams (unigrams) are: "The", "quick", "brown", "fox"
- 2-grams (bigrams) are: "The quick", "quick brown", "brown fox"
- 3-grams (trigrams): "The quick brown", "quick brown fox"
- … and so on

Modified n-gram Precision

To calculate precision, we can simply count up the number of generated text (unigrams) that appear in reference text and then divides them by the total number of words in generated text. This can be misleading especially when the generated text repeats words that happen to exist in the reference.
Thats when modified n-gram precision comes in, which is the count of clipped matches divided by the total words in generated text. Clipping means restricting the count of each word in the generated text to the maximum number of times it appears in the reference.

Calculation of BLEU Score

For each n-gram, modified n-gram precision is calculated. $P_1, P_2, \ldots, P_N$
To combine these individual scores, their geometric mean is taken. (Geometric mean is taken as it is more sensitive to imbalances than arithmetic mean as we want to penalise if the scores is low at any n-gram level)
The geometric mean of these scores in log form is written as: $\text{BLEU} = \exp\left( \sum_{n=1}^{N} w_n \cdot \log P_n \right)$

\begin{align*} where,\\ P_n & : \text{Modified precision for n-gram level } n \text{ (e.g., unigrams, bigrams, ...)} \\ w_n & : \text{Weight assigned to n-gram level } n \text{ (usually equal} \\ \log P_n & : \text{Natural log used to stabilize the product of small precision values} \\ \end{align*}

BP = \begin{cases} 1 & \text{if } c > r \\ e^{1 - \frac{r}{c}} & \text{if } c \leq r \end{cases}

c: \text{ length of the generated sentence} \\r: \text{ length of the reference sentence}

If the generated text is long enough or equal to the reference, BP = 1 (no penalty)
If the generated text is too short, BP < 1 (penalises the score)
So the final BLEU score comes out as: $\text{BLEU} = BP \cdot \exp\left( \sum_{n=1}^{N} w_n \cdot \log P_n \right)$

Evaluation Using SDK

Click here to learn how to setup evaluation using SDK.

Input & Configuration:

	Parameter	Type	Description
Required Inputs	`reference`	`str`	Model-generated output to be evaluated.
	`hypothesis`	`str` or `List[str]`	One or more reference texts.

Output:

Output Field	Type	Description
`score`	`float`	Score between 0 and 1. Higher values indicate greater lexical overlap.

result = evaluator.evaluate(
    eval_templates="bleu_score",
    inputs={
        "reference": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
        "hypothesis": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
    }
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

What if BLEU Score is Low?

It could be due to the model generating correct meaning but different word ordering, thus reducing higher order n-gram precision. Try reducing max_n_gram in such case.
Or it could be due to he generated text being shorter than the reference text, causing BP to reduce the score.
Aggregate the BLEU score with semantic evals like Embedding Similarity using Aggregated Metric to have a holistic view of comparing generated text with reference text.

Use different smooth method, such as smooth="method2" or smooth="method4" to mitigate the impact of zero matches at higher n-gram levels and obtain more stable scores for short or diverse outputs. Below is the guideline on what smoothing method to use based on different scenarios:

Scenario	Suggested Smoothing Method
Short outputs	`method1` or `method2`
High variance in phrasing	`method4` or `method5`
Very strict evaluation	`method0` (no smoothing)
General use	`method1` (default) or `method2` (balanced smoothing)
Sparse references or low match rate (e.g., summaries)	`method3`
Mixed-length outputs with partial n-gram match	`method6`
when you want strictness early on, but flexibility only after the first break in match continuity.	`method7`

Introduction

Evaluation

Simulations

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

Overview

n-gram

Modified n-gram Precision

Calculation of BLEU Score

Evaluation Using SDK

What if BLEU Score is Low?

Introduction

Evaluation

Simulations

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

​Overview

​n-gram

​Modified n-gram Precision

​Calculation of BLEU Score

​Evaluation Using SDK

​What if BLEU Score is Low?

Overview

n-gram

Modified n-gram Precision

Calculation of BLEU Score

Evaluation Using SDK

What if BLEU Score is Low?