Overview
BLEU (Bilingual Evaluation Understudy) is a lexical-level eval that evaluates how many contiguous sequence of words (n-grams) in the generated text are also present in the reference text. It gives a numeric score between 0 and 1 quantifying how much the generated text looks like the reference text. Higher the score the more similar the generated text is to the reference text.n-gram
- An n-gram is contiguous sequence of “n” words in a row.
- For example,
"The quick brown fox jumps over the lazy dog"
- 1-grams (unigrams) are:
"The"
,"quick"
,"brown"
,"fox"
- 2-grams (bigrams) are:
"The quick"
,"quick brown"
,"brown fox"
- 3-grams (trigrams):
"The quick brown"
,"quick brown fox"
- … and so on
- 1-grams (unigrams) are:
Modified n-gram Precision
- To calculate precision, we can simply count up the number of generated text (unigrams) that appear in reference text and then divides them by the total number of words in generated text. This can be misleading especially when the generated text repeats words that happen to exist in the reference.
- Thats when modified n-gram precision comes in, which is the count of clipped matches divided by the total words in generated text. Clipping means restricting the count of each word in the generated text to the maximum number of times it appears in the reference.
Calculation of BLEU Score
- For each n-gram, modified n-gram precision is calculated.
- To combine these individual scores, their geometric mean is taken. (Geometric mean is taken as it is more sensitive to imbalances than arithmetic mean as we want to penalise if the scores is low at any n-gram level)
- The geometric mean of these scores in log form is written as:
- If the generated text is long enough or equal to the reference, BP = 1 (no penalty)
- If the generated text is too short, BP < 1 (penalises the score)
- So the final BLEU score comes out as:
Evaluation Using SDK
Click here to learn how to setup evaluation using SDK.Input & Configuration:
Parameter | Type | Description | |
---|---|---|---|
Required Inputs | reference | str | Model-generated output to be evaluated. |
hypothesis | str or List[str] | One or more reference texts. |
Output Field | Type | Description |
---|---|---|
score | float | Score between 0 and 1. Higher values indicate greater lexical overlap. |
What if BLEU Score is Low?
-
It could be due to the model generating correct meaning but different word ordering, thus reducing higher order n-gram precision. Try reducing
max_n_gram
in such case. - Or it could be due to he generated text being shorter than the reference text, causing BP to reduce the score.
-
Aggregate the BLEU score with semantic evals like
Embedding Similarity
usingAggregated Metric
to have a holistic view of comparing generated text with reference text. -
Use different smooth method, such as
smooth="method2"
orsmooth="method4"
to mitigate the impact of zero matches at higher n-gram levels and obtain more stable scores for short or diverse outputs. Below is the guideline on what smoothing method to use based on different scenarios:Scenario Suggested Smoothing Method Short outputs method1
ormethod2
High variance in phrasing method4
ormethod5
Very strict evaluation method0
(no smoothing)General use method1
(default) ormethod2
(balanced smoothing)Sparse references or low match rate (e.g., summaries) method3
Mixed-length outputs with partial n-gram match method6
when you want strictness early on, but flexibility only after the first break in match continuity. method7