BLEU
Measures n-gram overlap precision between the generated and reference text.
Overview
BLEU (Bilingual Evaluation Understudy) is a lexical-level eval that evaluates how many contiguous sequence of words (n-grams) in the generated text are also present in the reference text. It gives a numeric score between 0 and 1 quantifying how much the generated text looks like the reference text. Higher the score the more similar the generated text is to the reference text.
n-gram
- An n-gram is contiguous sequence of “n” words in a row.
- For example,
"The quick brown fox jumps over the lazy dog"
- 1-grams (unigrams) are:
"The"
,"quick"
,"brown"
,"fox"
- 2-grams (bigrams) are:
"The quick"
,"quick brown"
,"brown fox"
- 3-grams (trigrams):
"The quick brown"
,"quick brown fox"
- … and so on
- 1-grams (unigrams) are:
Modified n-gram Precision
- To calculate precision, we can simply count up the number of generated text (unigrams) that appear in reference text and then divides them by the total number of words in generated text. This can be misleading especially when the generated text repeats words that happen to exist in the reference.
- Thats when modified n-gram precision comes in, which is the count of clipped matches divided by the total words in generated text. Clipping means restricting the count of each word in the generated text to the maximum number of times it appears in the reference.
Calculation of BLEU Score
-
For each n-gram, modified n-gram precision is calculated.
-
To combine these individual scores, their geometric mean is taken. (Geometric mean is taken as it is more sensitive to imbalances than arithmetic mean as we want to penalise if the scores is low at any n-gram level)
-
The geometric mean of these scores in log form is written as:
-
If the generated text is long enough or equal to the reference, BP = 1 (no penalty)
-
If the generated text is too short, BP < 1 (penalises the score)
-
So the final BLEU score comes out as:
BLEU Score Eval using Future AGI’s Python SDK
Click here to learn how to setup evaluation using the Python SDK.
Input & Configuration:
Parameter | Type | Description | |
---|---|---|---|
Required Inputs | response | str | Model-generated output to be evaluated. |
expected_text | str or List[str] | One or more reference texts. | |
Optional Config | mode | str | Evaluation mode. Options: "sentence" (default) , "corpus" |
max_n_gram | int | Maximum n-gram size. Default is 4 (evaluates unigrams up to 4-grams). Set to 1 , 2 , 3 , or 4 depending on how deep the overlap comparison should go. | |
weights | List[float] | Optional list of weights per n-gram. Must match max_n_gram in length and sum to 1.0. For example: [0.4, 0.3, 0.2, 0.1] . | |
smooth | str | Smoothing strategy to avoid zero scores for high-order n-grams. Options: "method0" ,"method1" , "method2" , "method3" , "method4" , "method5" , "method6" , "method7" . Default is "method1" . |
Parameter Options:
Parameter - mode | Description |
---|---|
"sentence" | When reference text is provided as a single string. |
"corpus" | When reference text is provided as a list of strings |
Parameter - smooth | Description |
---|---|
method0 | No smoothing. If any n-gram has zero matches, the BLEU score becomes zero. Use only when evaluating long texts with consistent overlap. |
method1 | Adds 1 to the numerator and denominator of the modified precision for each n-gram level. This prevents zero division and stabilizes the score, especially for short sentences. |
method2 | Exponential decay smoothing. Adds a small value 1/(2^k) to the precision of n-grams where matches are zero, where k is the n-gram order (e.g., 1/2 for bigrams, 1/4 for trigrams). This gently rewards the presence of lower-order matches. |
method3 | Exponential decay with prior knowledge. Similar to method2, but adjusts the denominator as if we had already observed one example of each n-gram. This avoids zero precision and biases scores slightly higher. |
method4 | Interpolated smoothing. Takes the average of observed precision and an expected value (like a uniform distribution). Effective when reference corpora are small or diverse. |
method5 | When higher-order matches are missing, the score backs off to lower n-gram matches. Designed for translation tasks with varied phrasing. |
method6 | Geometric average smoothing. Uses the geometric mean of previous precisions to estimate missing ones. Designed to make the score degrade more gracefully. |
method7 | Longest match-based smoothing. Applies smoothing only after the first occurrence of zero precision. Intended to avoid over-smoothing and retain high scores for strong candidates. |
Output:
Output Field | Type | Description |
---|---|---|
score | float | Score between 0 and 1. Higher values indicate greater lexical overlap. |
Example:
Output:
What if BLEU Score is Low?
-
It could be due to the model generating correct meaning but different word ordering, thus reducing higher order n-gram precision. Try reducing
max_n_gram
in such case. -
Or it could be due to he generated text being shorter than the reference text, causing BP to reduce the score.
-
Aggregate the BLEU score with semantic evals like
Embedding Similarity
usingAggregated Metric
to have a holistic view of comparing generated text with reference text. -
Use different smooth method, such as
smooth="method2"
orsmooth="method4"
to mitigate the impact of zero matches at higher n-gram levels and obtain more stable scores for short or diverse outputs. Below is the guideline on what smoothing method to use based on different scenarios:Scenario Suggested Smoothing Method Short outputs method1
ormethod2
High variance in phrasing method4
ormethod5
Very strict evaluation method0
(no smoothing)General use method1
(default) ormethod2
(balanced smoothing)Sparse references or low match rate (e.g., summaries) method3
Mixed-length outputs with partial n-gram match method6
when you want strictness early on, but flexibility only after the first break in match continuity. method7