Overview

BLEU (Bilingual Evaluation Understudy) is a lexical-level eval that evaluates how many contiguous sequence of words (n-grams) in the generated text are also present in the reference text. It gives a numeric score between 0 and 1 quantifying how much the generated text looks like the reference text. Higher the score the more similar the generated text is to the reference text.

n-gram

  • An n-gram is contiguous sequence of “n” words in a row.
  • For example, "The quick brown fox jumps over the lazy dog"
    • 1-grams (unigrams) are: "The", "quick", "brown", "fox"
    • 2-grams (bigrams) are: "The quick", "quick brown", "brown fox"
    • 3-grams (trigrams): "The quick brown", "quick brown fox"
    • … and so on

Modified n-gram Precision

  • To calculate precision, we can simply count up the number of generated text (unigrams) that appear in reference text and then divides them by the total number of words in generated text. This can be misleading especially when the generated text repeats words that happen to exist in the reference.
  • Thats when modified n-gram precision comes in, which is the count of clipped matches divided by the total words in generated text. Clipping means restricting the count of each word in the generated text to the maximum number of times it appears in the reference.

Calculation of BLEU Score

  • For each n-gram, modified n-gram precision is calculated. P1,P2,,PNP_1, P_2, \ldots, P_N
  • To combine these individual scores, their geometric mean is taken. (Geometric mean is taken as it is more sensitive to imbalances than arithmetic mean as we want to penalise if the scores is low at any n-gram level)
  • The geometric mean of these scores in log form is written as: BLEU=exp(n=1NwnlogPn)\text{BLEU} = \exp\left( \sum_{n=1}^{N} w_n \cdot \log P_n \right)
where,Pn:Modified precision for n-gram level n (e.g., unigrams, bigrams, ...)wn:Weight assigned to n-gram level n (usually equallogPn:Natural log used to stabilize the product of small precision values\begin{align*} where,\\ P_n & : \text{Modified precision for n-gram level } n \text{ (e.g., unigrams, bigrams, ...)} \\ w_n & : \text{Weight assigned to n-gram level } n \text{ (usually equal} \\ \log P_n & : \text{Natural log used to stabilize the product of small precision values} \\ \end{align*} BP={1if c>re1rcif crBP = \begin{cases} 1 & \text{if } c > r \\ e^{1 - \frac{r}{c}} & \text{if } c \leq r \end{cases} c: length of the generated sentencer: length of the reference sentencec: \text{ length of the generated sentence} \\r: \text{ length of the reference sentence}
  • If the generated text is long enough or equal to the reference, BP = 1 (no penalty)
  • If the generated text is too short, BP < 1 (penalises the score)
  • So the final BLEU score comes out as: BLEU=BPexp(n=1NwnlogPn)\text{BLEU} = BP \cdot \exp\left( \sum_{n=1}^{N} w_n \cdot \log P_n \right)

Evaluation Using SDK

Click here to learn how to setup evaluation using SDK.
Input & Configuration:
ParameterTypeDescription
Required InputsreferencestrModel-generated output to be evaluated.
hypothesisstr or List[str]One or more reference texts.
Output:
Output FieldTypeDescription
scorefloatScore between 0 and 1. Higher values indicate greater lexical overlap.
result = evaluator.evaluate(
    eval_templates="bleu_score",
    inputs={
        "reference": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
        "hypothesis": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
    }
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

What if BLEU Score is Low?

  • It could be due to the model generating correct meaning but different word ordering, thus reducing higher order n-gram precision. Try reducing max_n_gram in such case.
  • Or it could be due to he generated text being shorter than the reference text, causing BP to reduce the score.
  • Aggregate the BLEU score with semantic evals like Embedding Similarity using Aggregated Metric to have a holistic view of comparing generated text with reference text.
  • Use different smooth method, such as smooth="method2" or smooth="method4" to mitigate the impact of zero matches at higher n-gram levels and obtain more stable scores for short or diverse outputs. Below is the guideline on what smoothing method to use based on different scenarios:
    ScenarioSuggested Smoothing Method
    Short outputsmethod1 or method2
    High variance in phrasingmethod4 or method5
    Very strict evaluationmethod0 (no smoothing)
    General usemethod1 (default) or method2 (balanced smoothing)
    Sparse references or low match rate (e.g., summaries)method3
    Mixed-length outputs with partial n-gram matchmethod6
    when you want strictness early on, but flexibility only after the first break in match continuity.method7