Overview

BLEU (Bilingual Evaluation Understudy) is a lexical-level eval that evaluates how many contiguous sequence of words (n-grams) in the generated text are also present in the reference text. It gives a numeric score between 0 and 1 quantifying how much the generated text looks like the reference text. Higher the score the more similar the generated text is to the reference text.


n-gram

  • An n-gram is contiguous sequence of “n” words in a row.
  • For example, "The quick brown fox jumps over the lazy dog"
    • 1-grams (unigrams) are: "The", "quick", "brown", "fox"
    • 2-grams (bigrams) are: "The quick", "quick brown", "brown fox"
    • 3-grams (trigrams): "The quick brown", "quick brown fox"
    • … and so on

Modified n-gram Precision

  • To calculate precision, we can simply count up the number of generated text (unigrams) that appear in reference text and then divides them by the total number of words in generated text. This can be misleading especially when the generated text repeats words that happen to exist in the reference.
  • Thats when modified n-gram precision comes in, which is the count of clipped matches divided by the total words in generated text. Clipping means restricting the count of each word in the generated text to the maximum number of times it appears in the reference.

Calculation of BLEU Score

  • For each n-gram, modified n-gram precision is calculated.

    P1,P2,,PNP_1, P_2, \ldots, P_N
  • To combine these individual scores, their geometric mean is taken. (Geometric mean is taken as it is more sensitive to imbalances than arithmetic mean as we want to penalise if the scores is low at any n-gram level)

  • The geometric mean of these scores in log form is written as:

    BLEU=exp(n=1NwnlogPn)\text{BLEU} = \exp\left( \sum_{n=1}^{N} w_n \cdot \log P_n \right)
where,Pn:Modified precision for n-gram level n (e.g., unigrams, bigrams, ...)wn:Weight assigned to n-gram level n (usually equallogPn:Natural log used to stabilize the product of small precision values\begin{align*} where,\\ P_n & : \text{Modified precision for n-gram level } n \text{ (e.g., unigrams, bigrams, ...)} \\ w_n & : \text{Weight assigned to n-gram level } n \text{ (usually equal} \\ \log P_n & : \text{Natural log used to stabilize the product of small precision values} \\ \end{align*} BP={1if c>re1rcif crBP = \begin{cases} 1 & \text{if } c > r \\ e^{1 - \frac{r}{c}} & \text{if } c \leq r \end{cases} c: length of the generated sentencer: length of the reference sentencec: \text{ length of the generated sentence} \\r: \text{ length of the reference sentence}
  • If the generated text is long enough or equal to the reference, BP = 1 (no penalty)

  • If the generated text is too short, BP < 1 (penalises the score)

  • So the final BLEU score comes out as:

    BLEU=BPexp(n=1NwnlogPn)\text{BLEU} = BP \cdot \exp\left( \sum_{n=1}^{N} w_n \cdot \log P_n \right)

BLEU Score Eval using Future AGI’s Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input & Configuration:

ParameterTypeDescription
Required InputsresponsestrModel-generated output to be evaluated.
expected_textstr or List[str]One or more reference texts.
Optional ConfigmodestrEvaluation mode. Options: "sentence" (default) , "corpus"
max_n_gramintMaximum n-gram size. Default is 4 (evaluates unigrams up to 4-grams). Set to 1, 2, 3, or 4 depending on how deep the overlap comparison should go.
weightsList[float]Optional list of weights per n-gram. Must match max_n_gram in length and sum to 1.0. For example: [0.4, 0.3, 0.2, 0.1].
smoothstrSmoothing strategy to avoid zero scores for high-order n-grams. Options: "method0" ,"method1" , "method2" , "method3" , "method4" , "method5" , "method6" , "method7". Default is "method1".

Parameter Options:

Parameter - modeDescription
"sentence"When reference text is provided as a single string.
"corpus"When reference text is provided as a list of strings
Parameter - smoothDescription
method0No smoothing. If any n-gram has zero matches, the BLEU score becomes zero. Use only when evaluating long texts with consistent overlap.
method1Adds 1 to the numerator and denominator of the modified precision for each n-gram level. This prevents zero division and stabilizes the score, especially for short sentences.
method2Exponential decay smoothing. Adds a small value 1/(2^k) to the precision of n-grams where matches are zero, where k is the n-gram order (e.g., 1/2 for bigrams, 1/4 for trigrams). This gently rewards the presence of lower-order matches.
method3Exponential decay with prior knowledge. Similar to method2, but adjusts the denominator as if we had already observed one example of each n-gram. This avoids zero precision and biases scores slightly higher.
method4Interpolated smoothing. Takes the average of observed precision and an expected value (like a uniform distribution). Effective when reference corpora are small or diverse.
method5When higher-order matches are missing, the score backs off to lower n-gram matches. Designed for translation tasks with varied phrasing.
method6Geometric average smoothing. Uses the geometric mean of previous precisions to estimate missing ones. Designed to make the score degrade more gracefully.
method7Longest match-based smoothing. Applies smoothing only after the first occurrence of zero precision. Intended to avoid over-smoothing and retain high scores for strong candidates.

Output:

Output FieldTypeDescription
scorefloatScore between 0 and 1. Higher values indicate greater lexical overlap.

Example:

from fi.evals.metrics import BLEUScore
from fi.testcases import TestCase

test_case = TestCase(
    response="The quick brown fox jumps over the lazy dog",
    expected_text="The fast brown fox leaps over the sleepy dog"
)

bleu_evaluator = BLEUScore(config={
    "mode": "sentence",
    "max_n_gram": 2,
    "smooth": "method2"
})

result = bleu_evaluator.evaluate([test_case])
print(f"{result.eval_results[0].metrics[0].value:.4f}")

Output:

0.4714

What if BLEU Score is Low?

  • It could be due to the model generating correct meaning but different word ordering, thus reducing higher order n-gram precision. Try reducing max_n_gram in such case.

  • Or it could be due to he generated text being shorter than the reference text, causing BP to reduce the score.

  • Aggregate the BLEU score with semantic evals like Embedding Similarity using Aggregated Metric to have a holistic view of comparing generated text with reference text.

  • Use different smooth method, such as smooth="method2" or smooth="method4" to mitigate the impact of zero matches at higher n-gram levels and obtain more stable scores for short or diverse outputs. Below is the guideline on what smoothing method to use based on different scenarios:

    ScenarioSuggested Smoothing Method
    Short outputsmethod1 or method2
    High variance in phrasingmethod4 or method5
    Very strict evaluationmethod0 (no smoothing)
    General usemethod1 (default) or method2 (balanced smoothing)
    Sparse references or low match rate (e.g., summaries)method3
    Mixed-length outputs with partial n-gram matchmethod6
    when you want strictness early on, but flexibility only after the first break in match continuity.method7