Abstract
The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique, in other words whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outwith MT, for evaluation of individual texts, or for scientific hypothesis testing.
Original language | English |
---|---|
Pages (from-to) | 393-401 |
Number of pages | 9 |
Journal | Computational Linguistics |
Volume | 44 |
Issue number | 3 |
Early online date | 21 Sep 2018 |
DOIs | |
Publication status | Published - Sep 2018 |
Fingerprint
Dive into the research topics of 'A Structured Review of the Validity of BLEU'. Together they form a unique fingerprint.Datasets
-
Structured Review of the Validity of BLEU
Reiter, E. (Creator), University of Aberdeen, 2018
DOI: 10.20392/766c9dd8-75a7-4761-915d-856c0f7cc3c4
Dataset
Profiles
-
Ehud Reiter
- Computational Linguistics at Aberdeen
- School of Natural & Computing Sciences, Computing Science - Chair in Computing Science.
Person: Academic