VOOZH about

URL: https://www.geeksforgeeks.org/nlp/understanding-bleu-and-rouge-score-for-nlp-evaluation/

⇱ BLEU and ROUGE score for NLP evaluation - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

BLEU and ROUGE score for NLP evaluation

Last Updated : 17 Mar, 2026

In Natural Language Processing, evaluating generated text is essential to understand how well a model performs. Metrics such as BLEU and ROUGE are commonly used to compare machine-generated output with human-written reference text. These metrics quantify how closely the generated content matches the expected result in terms of accuracy and relevance.

πŸ‘ bleu-and-rouge
BLEU and ROUGE
  • Both metrics compare the model’s output (candidate text) with one or more human reference texts.
  • They measure similarity based on overlapping words and phrases.
  • The scores help in comparing different models and improving performance.
  • The final output is a numerical score, making the evaluation objective and easy to scale.

Understanding BLEU Score

The BLEU (Bilingual Evaluation Understudy) score is a metric mainly used to evaluate machine translation systems. It measures how closely a machine-generated translation matches one or more human-written reference translations. The basic idea is that the more similar the candidate text is to the reference text, the better the translation quality.

πŸ‘ BLEU-score
BLEU score

This metric works by:

  • Comparing n grams (continuous word sequences like unigrams, bigrams, trigrams) between candidate and reference text.
  • Calculating precision at different n gram levels to check how many word sequences match.
  • Combining these precision values into a single overall score.
  • Including a length penalty to ensure that overly short translations do not receive artificially high scores.
  • Producing a final score between 0 and 1, where values closer to 1 indicate higher similarity to the reference translation.

Working of BLEU

BLEU is based on modified n gram precision combined with a brevity penalty. First, the modified n gram precision for n grams is calculated as:

To prevent very short translations from receiving high precision scores, BLEU applies a brevity penalty (BP):

Where is the candidate length and is the reference length.

The final BLEU score combines the geometric mean of n gram precisions with the brevity penalty:

Here, are weights (usually equal) and represents n gram precision.

Understanding ROUGE Score

The ROUGE (Recall Oriented Understudy for Gisting Evaluation) score is mainly used to evaluate text summarization and other text generation tasks. It measures how much of the important information from the reference text is captured in the generated output. The final ROUGE score ranges from 0 to 1, where higher values indicate better content coverage and similarity to the reference text.

Unlike BLEU, which focuses more on precision, ROUGE emphasizes recall, meaning it checks how much relevant content is covered.

πŸ‘ ROUGE-score
ROUGE score

ROUGE works through different variants:

  • ROUGE-N: Measures overlap of n grams (word sequences) between candidate and reference text.
  • ROUGE-L: Uses the longest common subsequence (LCS) to evaluate sentence level similarity.
  • ROUGE-S: Measures skip bigram overlap, allowing gaps between paired words.

Working of ROUGE

ROUGE focuses on recall rather than precision. For ROUGE N, the formula is:

ROUGE-L is based on the Longest Common Subsequence (LCS). Its recall version is:

ROUGE-S is based on skip bigrams, which are pairs of words that appear in the same order in a sentence, but not necessarily consecutively. This allows the metric to capture flexible word ordering while preserving sequence structure. The recall-based ROUGE-S formula is:

BLEU vs ROUGE

Both BLEU and ROUGE are automated evaluation metrics, but they measure text quality from different perspectives. The comparison below highlights their primary differences in focus, usage and evaluation strategy.

Aspect

BLEU

ROUGE

Main Focus

Precision (how much generated text matches reference)

Recall (how much reference content is covered)

Primary Use Case

Machine Translation

Text Summarization

Matching Method

n gram overlap with precision calculation

n gram, LCS and skip bigram overlap with recall emphasis

Length Handling

Uses brevity penalty for short outputs

No strict brevity penalty mechanism

Score Range

0 to 1 (higher is better)

0 to 1 (higher is better)

When to Use Which Metric

  1. Use BLEU for evaluating machine translation and tasks where exact phrase precision matters.
  2. Use ROUGE for summarization tasks where coverage of key concepts is important.

Model Evaluation using BLEU and ROUGE

In this section, we evaluate the output of a real pretrained language model using BLEU and ROUGE. Instead of comparing dummy strings, we generate text from a model and measure how closely it matches a human written reference.

Step 1: Install Required Libraries

Run the following command in your command prompt

pip install transformers torch nltk rouge-score

Step 2: Import Required Libraries

  • Pytorch is used to run the model and handle tensor operations.
  • Transformers to load and generate output from a pretrained model
  • nltk to compute BLEU score
  • rouge score to compute ROUGE metrics

Step 3: Load a Pretrained Language Model

This code loads the FLAN-T5 base model for sequence to sequence text generation. The tokenizer converts text into model ready tokens and the model loads its pretrained weights to generate outputs for evaluation.

Output:

πŸ‘ output
Loading pretrained model

Step 4: Generate Text from the Model

This code generates text from the pretrained model using a given prompt.

  • tokenizer converts the prompt into tensors suitable for the model.
  • torch.no_grad() disables gradient computation since we are only performing inference.
  • model.generate() produces output tokens, limited to 80 new tokens.
  • tokenizer.decode() converts the generated tokens back into readable text.

Output:

Generated Text: Machine learning is a technique for detecting patterns in data.

Step 5: Prepare Reference and Candidate Text

This code prepares the human written reference and the model generated output for evaluation.

  • reference_text represents the ground truth sentence.
  • candidate_text contains the text generated by the model.
  • Both texts are split into tokens (words) because BLEU requires tokenized input.
  • The reference is wrapped inside a list since BLEU expects one or more reference sentences.

Step 6: Compute BLEU Score

This code calculates the BLEU score between the reference text and the model generated output.

  • SmoothingFunction().method1 is applied to avoid zero scores when higher order n grams do not match.
  • sentence_bleu() compares the tokenized candidate text against the reference tokens.
  • The final score reflects how closely the generated output matches the reference in terms of n gram precision.

Output:

BLEU Score: 0.124

Step 7: Compute ROUGE Score

This code evaluates the generated text using ROUGE metrics.

  • rouge1 measures unigram overlap.
  • rouge2 measures bigram overlap.
  • rougeL measures the longest common subsequence similarity.
  • use_stemmer=True improves matching by reducing words to their root forms.

Output:

πŸ‘ output2
Output

You can download the full code from here

Limitations of BLEU and ROUGE

Although BLEU and ROUGE are widely used for automatic evaluation, they have inherent limitations.

  1. Dependence on N gram Overlap: Both metrics rely on surface level word or phrase matching, which may not fully capture fluency, coherence or semantic meaning.
  2. Limited Semantic Understanding: BLEU measures precision of matching phrases but may fail to recognize correct translations that use different wording.
  3. Recall Bias in ROUGE: ROUGE emphasizes recall and may reward longer or repetitive outputs that overlap more with the reference.
Comment
Article Tags:

Explore