The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results

Anya Belz, Anastasia Shimorina, Shubham Agarwal, Ehud Reiter

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

17 Citations (Scopus)
12 Downloads (Pure)

Abstract

The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation. Much of the work on reproducibility has so far focused on metric scores, with reproducibility of human evaluation results receiving far less attention. As part of a research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the first shared task on reproducibility of human evaluations, ReproGen 2021. This paper describes the shared task in detail, summarises results from each of the reproduction studies submitted, and provides further comparative analysis of the results. Out of nine initial team registrations, we received submissions from four teams. Meta-analysis of the four reproduction studies revealed varying degrees of reproducibility, and allowed very tentative first conclusions about what types of evaluation tend to have better reproducibility.
Original languageEnglish
Title of host publicationThe 14th International Conference on Natural Language Generation
Subtitle of host publicationProceedings of the Conference
Pages249–258
Number of pages10
Publication statusPublished - 31 Aug 2021
EventThe 14th International Conference on Natural Language Generation - Virtual, Aberdeen, United Kingdom
Duration: 20 Sept 202124 Sept 2021
Conference number: 14
https://inlg2021.github.io/index.html

Conference

ConferenceThe 14th International Conference on Natural Language Generation
Country/TerritoryUnited Kingdom
CityAberdeen
Period20/09/2124/09/21
Internet address

Fingerprint

Dive into the research topics of 'The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results'. Together they form a unique fingerprint.

Cite this