A Structured Review of the Validity of BLEU

Research output: Contribution to journalArticle

1 Citation (Scopus)
6 Downloads (Pure)

Abstract

The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique, in other words whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outwith MT, for evaluation of individual texts, or for scientific hypothesis testing.
Original languageEnglish
Pages (from-to)393-401
Number of pages9
JournalComputational Linguistics
Volume44
Issue number3
Early online date21 Sep 2018
DOIs
Publication statusPublished - Sep 2018

Fingerprint

evaluation
hypothesis testing
evidence
diagnostic
Testing
language
Evaluation
Natural Language Processing
Natural Language
Diagnostics
Language Generation
Hypothesis Testing
Machine Translation
Real World

Cite this

A Structured Review of the Validity of BLEU. / Reiter, Ehud.

In: Computational Linguistics, Vol. 44, No. 3, 09.2018, p. 393-401.

Research output: Contribution to journalArticle

@article{d72886c4bc0e4a2fabe64b173120beaf,
title = "A Structured Review of the Validity of BLEU",
abstract = "The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique, in other words whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outwith MT, for evaluation of individual texts, or for scientific hypothesis testing.",
author = "Ehud Reiter",
year = "2018",
month = "9",
doi = "10.1162/COLI_a_00322",
language = "English",
volume = "44",
pages = "393--401",
journal = "Computational Linguistics",
issn = "0891-2017",
publisher = "MIT Press Journals",
number = "3",

}

TY - JOUR

T1 - A Structured Review of the Validity of BLEU

AU - Reiter, Ehud

PY - 2018/9

Y1 - 2018/9

N2 - The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique, in other words whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outwith MT, for evaluation of individual texts, or for scientific hypothesis testing.

AB - The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique, in other words whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outwith MT, for evaluation of individual texts, or for scientific hypothesis testing.

U2 - 10.1162/COLI_a_00322

DO - 10.1162/COLI_a_00322

M3 - Article

VL - 44

SP - 393

EP - 401

JO - Computational Linguistics

JF - Computational Linguistics

SN - 0891-2017

IS - 3

ER -