Quantified Reproducibility Assessment of NLP Results

Anja Belz; Maja Popovic; Simon Mille

doi:10.18653/v1/2022.acl-long.2

Quantified Reproducibility Assessment of NLP Results

Anja Belz, Maja Popovic, Simon Mille

University of Brighton

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

Abstract

This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 different system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but also of different, original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and as a result, allows conclusions to be drawn about what aspects of system and/or evaluation design need to be changed in order to improve reproducibility.

Original language	English
Title of host publication	Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
Place of Publication	Dublin, Ireland
Publisher	Association for Computational Linguistics
Pages	16-28
Volume	1
Edition	Long Papers
DOIs	https://doi.org/10.18653/v1/2022.acl-long.2 https://doi.org/10.48448/cpyr-nf04
Publication status	Published - May 2022

Bibliographical note

Anya Belz, Maja Popovic, and Simon Mille. 2022. Quantified Reproducibility Assessment of NLP Results. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16–28, Dublin, Ireland. Association for Computational Linguistics.

Access to Document

10.18653/v1/2022.acl-long.2Licence: CC BY
10.48448/cpyr-nf04

Cite this

Quantified Reproducibility Assessment of NLP Results. / Belz, Anja; Popovic, Maja; Mille, Simon.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Vol. 1 Long Papers. ed. Dublin, Ireland: Association for Computational Linguistics, 2022. p. 16-28.

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

@inproceedings{222fe095658d463eb04f9b2bb52ebfa1,

title = "Quantified Reproducibility Assessment of NLP Results",

abstract = "This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 different system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but also of different, original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and as a result, allows conclusions to be drawn about what aspects of system and/or evaluation design need to be changed in order to improve reproducibility.",

author = "Anja Belz and Maja Popovic and Simon Mille",

note = "Anya Belz, Maja Popovic, and Simon Mille. 2022. Quantified Reproducibility Assessment of NLP Results. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16–28, Dublin, Ireland. Association for Computational Linguistics.",

year = "2022",

month = may,

doi = "10.18653/v1/2022.acl-long.2",

language = "English",

volume = "1",

pages = "16--28",

booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",

publisher = "Association for Computational Linguistics",

edition = "Long Papers",

}

TY - GEN

T1 - Quantified Reproducibility Assessment of NLP Results

AU - Belz, Anja

AU - Popovic, Maja

AU - Mille, Simon

N1 - Anya Belz, Maja Popovic, and Simon Mille. 2022. Quantified Reproducibility Assessment of NLP Results. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16–28, Dublin, Ireland. Association for Computational Linguistics.

PY - 2022/5

Y1 - 2022/5

N2 - This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 different system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but also of different, original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and as a result, allows conclusions to be drawn about what aspects of system and/or evaluation design need to be changed in order to improve reproducibility.

AB - This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 different system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but also of different, original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and as a result, allows conclusions to be drawn about what aspects of system and/or evaluation design need to be changed in order to improve reproducibility.

U2 - 10.18653/v1/2022.acl-long.2

DO - 10.18653/v1/2022.acl-long.2

M3 - Published conference contribution

VL - 1

SP - 16

EP - 28

BT - Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics

PB - Association for Computational Linguistics

CY - Dublin, Ireland

ER -

Quantified Reproducibility Assessment of NLP Results

Abstract

Bibliographical note

Access to Document

Fingerprint

Cite this