TY - GEN
T1 - Quantified Reproducibility Assessment of NLP Results
AU - Belz, Anja
AU - Popovic, Maja
AU - Mille, Simon
N1 - Anya Belz, Maja Popovic, and Simon Mille. 2022. Quantified Reproducibility Assessment of NLP Results. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16–28, Dublin, Ireland. Association for Computational Linguistics.
PY - 2022/5
Y1 - 2022/5
N2 - This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 different system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but also of different, original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and as a result, allows conclusions to be drawn about what aspects of system and/or evaluation design need to be changed in order to improve reproducibility.
AB - This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 different system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but also of different, original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and as a result, allows conclusions to be drawn about what aspects of system and/or evaluation design need to be changed in order to improve reproducibility.
U2 - 10.18653/v1/2022.acl-long.2
DO - 10.18653/v1/2022.acl-long.2
M3 - Published conference contribution
VL - 1
SP - 16
EP - 28
BT - Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
PB - Association for Computational Linguistics
CY - Dublin, Ireland
ER -