Data Quality Assessment and Anomaly Detection Via Map/Reduce and Linked Data: A Case Study in the Medical Domain

Stephen Bonner; Andrew Stephen McGough; Ibad Kureshi; John Brennan; Georgios Theodoropoulos; Laura Moss; David Corsar; Grigoris Antoniou

Data Quality Assessment and Anomaly Detection Via Map/Reduce and Linked Data: A Case Study in the Medical Domain

Stephen Bonner^*, Andrew Stephen McGough, Ibad Kureshi, John Brennan, Georgios Theodoropoulos, Laura Moss, David Corsar, Grigoris Antoniou

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

14 Citations (Scopus)

Abstract

Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3% - 26.9% in a selection of medical databases.

Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data.

In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map /Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach.

Original language	English
Title of host publication	Proceedings 2015 IEEE International Conference On Big Data
Editors	H Ho, BC Ooi, MJ Zaki, XH Hu, L Haas, Kumar, S Rachuri, SP Yu, MHI Hsiao, J Li, F Luo, S Pyne, K Ogan
Publisher	IEEE Press
Pages	737-746
Number of pages	10
ISBN (Electronic)	978-1-4799-9926-2
ISBN (Print)	978-1-4799-9927-9
Publication status	Published - 2015
Event	IEEE International Conference on Big Data - Santa Clara, Canada Duration: 29 Oct 2015 → 1 Nov 2015

Conference

Conference	IEEE International Conference on Big Data
Country/Territory	Canada
City	Santa Clara
Period	29/10/15 → 1/11/15

Bibliographical note

The authors would like to acknowledge the use of the University of Huddersfield Queensgate Grid in carrying out this work. We would also like to thank EPSRC for continued funding. In addition we would like to acknowledge the clinical input into our earlier work from Prof. John Kinsella (Glasgow Royal Infirmary) and Dr. Ian Piper (Institute of Neurological Sciences, New South Glasgow Hospital).

Keywords

RDF
Medical Data
Map / Reduce
Joins

Access to Document

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7363818Licence: Unspecified

Cite this

Bonner, S., McGough, A. S., Kureshi, I., Brennan, J., Theodoropoulos, G., Moss, L., Corsar, D., & Antoniou, G. (2015). Data Quality Assessment and Anomaly Detection Via Map/Reduce and Linked Data: A Case Study in the Medical Domain. In H. Ho, BC. Ooi, MJ. Zaki, XH. Hu, L. Haas, Kumar, S. Rachuri, SP. Yu, MHI. Hsiao, J. Li, F. Luo, S. Pyne, & K. Ogan (Eds.), Proceedings 2015 IEEE International Conference On Big Data (pp. 737-746). IEEE Press. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7363818

Data Quality Assessment and Anomaly Detection Via Map/Reduce and Linked Data: A Case Study in the Medical Domain. / Bonner, Stephen; McGough, Andrew Stephen; Kureshi, Ibad et al.
Proceedings 2015 IEEE International Conference On Big Data. ed. / H Ho; BC Ooi; MJ Zaki; XH Hu; L Haas; Kumar; S Rachuri; SP Yu; MHI Hsiao; J Li; F Luo; S Pyne; K Ogan. IEEE Press, 2015. p. 737-746.

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

Bonner, S, McGough, AS, Kureshi, I, Brennan, J, Theodoropoulos, G, Moss, L, Corsar, D & Antoniou, G 2015, Data Quality Assessment and Anomaly Detection Via Map/Reduce and Linked Data: A Case Study in the Medical Domain. in H Ho, BC Ooi, MJ Zaki, XH Hu, L Haas, Kumar, S Rachuri, SP Yu, MHI Hsiao, J Li, F Luo, S Pyne & K Ogan (eds), Proceedings 2015 IEEE International Conference On Big Data. IEEE Press, pp. 737-746, IEEE International Conference on Big Data, Santa Clara, Canada, 29/10/15. <http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7363818>

Bonner S, McGough AS, Kureshi I, Brennan J, Theodoropoulos G, Moss L et al. Data Quality Assessment and Anomaly Detection Via Map/Reduce and Linked Data: A Case Study in the Medical Domain. In Ho H, Ooi BC, Zaki MJ, Hu XH, Haas L, Kumar, Rachuri S, Yu SP, Hsiao MHI, Li J, Luo F, Pyne S, Ogan K, editors, Proceedings 2015 IEEE International Conference On Big Data. IEEE Press. 2015. p. 737-746

Bonner, Stephen ; McGough, Andrew Stephen ; Kureshi, Ibad et al. / Data Quality Assessment and Anomaly Detection Via Map/Reduce and Linked Data : A Case Study in the Medical Domain. Proceedings 2015 IEEE International Conference On Big Data. editor / H Ho ; BC Ooi ; MJ Zaki ; XH Hu ; L Haas ; Kumar ; S Rachuri ; SP Yu ; MHI Hsiao ; J Li ; F Luo ; S Pyne ; K Ogan. IEEE Press, 2015. pp. 737-746

@inproceedings{f4c9c880381c42d78ccf1c9b31cf7f9d,

title = "Data Quality Assessment and Anomaly Detection Via Map/Reduce and Linked Data: A Case Study in the Medical Domain",

abstract = "Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3% - 26.9% in a selection of medical databases.Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data.In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map /Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach.",

keywords = "RDF, Medical Data, Map / Reduce, Joins",

author = "Stephen Bonner and McGough, {Andrew Stephen} and Ibad Kureshi and John Brennan and Georgios Theodoropoulos and Laura Moss and David Corsar and Grigoris Antoniou",

note = "The authors would like to acknowledge the use of the University of Huddersfield Queensgate Grid in carrying out this work. We would also like to thank EPSRC for continued funding. In addition we would like to acknowledge the clinical input into our earlier work from Prof. John Kinsella (Glasgow Royal Infirmary) and Dr. Ian Piper (Institute of Neurological Sciences, New South Glasgow Hospital).; IEEE International Conference on Big Data ; Conference date: 29-10-2015 Through 01-11-2015",

year = "2015",

language = "English",

isbn = "978-1-4799-9927-9",

pages = "737--746",

editor = "H Ho and BC Ooi and MJ Zaki and XH Hu and L Haas and Kumar and S Rachuri and SP Yu and MHI Hsiao and J Li and F Luo and S Pyne and K Ogan",

booktitle = "Proceedings 2015 IEEE International Conference On Big Data",

publisher = "IEEE Press",

}

TY - GEN

T1 - Data Quality Assessment and Anomaly Detection Via Map/Reduce and Linked Data

T2 - IEEE International Conference on Big Data

AU - Bonner, Stephen

AU - McGough, Andrew Stephen

AU - Kureshi, Ibad

AU - Brennan, John

AU - Theodoropoulos, Georgios

AU - Moss, Laura

AU - Corsar, David

AU - Antoniou, Grigoris

N1 - The authors would like to acknowledge the use of the University of Huddersfield Queensgate Grid in carrying out this work. We would also like to thank EPSRC for continued funding. In addition we would like to acknowledge the clinical input into our earlier work from Prof. John Kinsella (Glasgow Royal Infirmary) and Dr. Ian Piper (Institute of Neurological Sciences, New South Glasgow Hospital).

PY - 2015

Y1 - 2015

N2 - Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3% - 26.9% in a selection of medical databases.Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data.In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map /Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach.

AB - Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3% - 26.9% in a selection of medical databases.Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data.In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map /Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach.

KW - RDF

KW - Medical Data

KW - Map / Reduce

KW - Joins

M3 - Published conference contribution

SN - 978-1-4799-9927-9

SP - 737

EP - 746

BT - Proceedings 2015 IEEE International Conference On Big Data

A2 - Ho, H

A2 - Ooi, BC

A2 - Zaki, MJ

A2 - Hu, XH

A2 - Haas, L

A2 - Kumar, null

A2 - Rachuri, S

A2 - Yu, SP

A2 - Hsiao, MHI

A2 - Li, J

A2 - Luo, F

A2 - Pyne, S

A2 - Ogan, K

PB - IEEE Press

Y2 - 29 October 2015 through 1 November 2015

ER -

Data Quality Assessment and Anomaly Detection Via Map/Reduce and Linked Data: A Case Study in the Medical Domain

Abstract

Conference

Bibliographical note

Keywords

Access to Document

Fingerprint

Cite this