Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Aleksandar Savkov, Anja Belz

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

16 Citations (Scopus)

Abstract

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.
Original languageEnglish
Title of host publicationProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
EditorsSmaranda Muresan, Preslav Nakov, Aline Villavicencio
Place of PublicationDublin
PublisherAssociation for Computational Linguistics
Pages5739–5754
Number of pages16
Volume1
ISBN (Electronic)978-1-955917-21-6
DOIs
Publication statusPublished - 1 May 2022
EventACL 2022: 60th Annual Meeting of the Association for Computational Linguistics - The Convention Centre Dublin , Dublin, Ireland
Duration: 22 May 202227 May 2022
Conference number: 60
https://www.2022.aclweb.org/

Conference

ConferenceACL 2022
Abbreviated titleACL
Country/TerritoryIreland
CityDublin
Period22/05/2227/05/22
Internet address

Bibliographical note

The authors would like to thank Rachel Young and Tom Knoll for supporting the team and hiring the evaluators, Vitalii Zhelezniak for his advice on revising the paper, and Kristian Boda for helping to set up the Stanza+Snomed fact-extraction system.

Keywords

  • Computation and Language (cs.CL)
  • FOS: Computer and information sciences

Fingerprint

Dive into the research topics of 'Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation'. Together they form a unique fingerprint.

Cite this