Back to Glossary
LLM

ROUGE Score

Definition

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics measuring text summarization quality by comparing overlap between generated summaries and reference summaries, focusing on recall of important content.

Why It Matters

ROUGE is the standard metric for text summarization. When research papers report summarization results, they report ROUGE scores. If you’re building any system that condenses text (document summarizers, meeting notes, article abstracts), you’ll measure quality with ROUGE.

While BLEU focuses on precision (how much of the output is correct), ROUGE emphasizes recall (how much of the reference content appears in the output). This matters for summarization: you want the summary to capture key information from the source, not just generate valid-sounding text.

For AI engineers building summarization or RAG systems, ROUGE provides automated evaluation during development. You can compare different prompts, models, or chunking strategies by measuring ROUGE on a test set. It’s not perfect, but it’s fast and correlates with human judgment.

Implementation Basics

ROUGE Variants

  • ROUGE-1: Unigram overlap (individual words)
  • ROUGE-2: Bigram overlap (two-word sequences)
  • ROUGE-L: Longest common subsequence (captures sentence structure)
  • ROUGE-Lsum: ROUGE-L applied to summary-level evaluation

Score Interpretation ROUGE reports precision, recall, and F1. F1 is typically the reported number. Scores are 0-1 (or 0-100):

  • ROUGE-1 F1 > 0.45 is generally good for news summarization
  • ROUGE-2 F1 > 0.20 indicates reasonable phrase-level overlap
  • Scores vary by domain, benchmark against dataset baselines

Implementation Use the rouge-score or evaluate libraries:

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
scores = scorer.score(reference, generated)

Limitations ROUGE measures word overlap, not semantic quality. It misses:

  • Paraphrased content (same meaning, different words)
  • Factual correctness
  • Coherence and readability

A summary could have high ROUGE by copying key phrases but still be a poor summary. Combine with human evaluation for production systems.

Source

ROUGE metrics correlate well with human judgments of summary quality and provide automatic evaluation for text summarization.

https://aclanthology.org/W04-1013/