A topological data analysis based classification method for multiple measurements

Henri Riihimäki; Wojciech Chachólski; Jakob Theorell; Jan Hillert; Ryan Ramanujam

doi:10.1186/s12859-020-03659-3

A topological data analysis based classification method for multiple measurements

Henri Riihimäki, Wojciech Chachólski, Jakob Theorell, Jan Hillert, Ryan Ramanujam^* (Corresponding Author)

^*Corresponding author for this work

Mathematical Science

Research output: Contribution to journal › Article › peer-review

13 Citations (Scopus)

5 Downloads (Pure)

Abstract

Background: Machine learning models for repeated measurements are limited. Using topological data analysis (TDA), we present a classifier for repeated measurements which samples from the data space and builds a network graph based on the data topology. A machine learning model with cross-validation is then applied for classification. When test this on three case studies, accuracy exceeds an alternative support vector machine (SVM) voting model in most situations tested, with additional benefits such as reporting data subsets with high purity along with feature values. Results: For 100 examples of 3 different tree species, the model reached 80% classification accuracy after 30 datapoints, which was improved to 90% after increased sampling to 400 datapoints. The alternative SVM classifier achieved a maximum accuracy of 68.7%. Using data from 100 examples from each class of 6 different random point processes, the classifier achieved 96.8% accuracy, vastly outperforming the SVM. Using two outcomes in neuron spiking data, the TDA classifier was similarly accurate to the SVM in one case (both converged to 97.8% accuracy), but was outperformed in the other (relative accuracies 79.8% and 92.2%, respectively). Conclusions: This algorithm and software can be beneficial for repeated measurement data common in biological sciences, as both an accurate classifier and a feature selection tool.

Original language	English
Article number	336
Number of pages	18
Journal	BMC Bioinformatics
Volume	21
DOIs	https://doi.org/10.1186/s12859-020-03659-3
Publication status	Published - 29 Jul 2020

Bibliographical note

HR was partly supported by a collaboration agreement between the University of Aberdeen and EPFL. WC was partially supported by VR 2014-04770 and Wallenberg AI, Autonomous System and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation, Göran Gustafsson Stiftelse. JT is fully funded by the Wenner-Gren Foundation. JH is partially supported by VR K825930053. RR is partially supported by MultipleMS. The collaboration agreement between EPFL and University of Aberdeen played a role in the design of the neuron spiking analysis and in providing the data required, i.e. the neuronal network and the spiking activity. Open access funding provided by Karolinska Institute.

Keywords

Topological data analysis
machine learning
multiple measurement analysis
Machine learning
Multiple measurement analysis
Trees/anatomy & histology
Humans
Rats
Support Vector Machine
Machine Learning
Algorithms
Animals
Lasers
Computer Simulation
Data Analysis

Access to Document

10.1186/s12859-020-03659-3Licence: CC BY

Riihimaki_etal_BMCbio_Topological_data_VOR
This article is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Final published version, 2.37 MBLicence: CC BY

Cite this

@article{62e79482c4fe4b999c0ace4d8e55b305,

title = "A topological data analysis based classification method for multiple measurements",

abstract = "Background: Machine learning models for repeated measurements are limited. Using topological data analysis (TDA), we present a classifier for repeated measurements which samples from the data space and builds a network graph based on the data topology. A machine learning model with cross-validation is then applied for classification. When test this on three case studies, accuracy exceeds an alternative support vector machine (SVM) voting model in most situations tested, with additional benefits such as reporting data subsets with high purity along with feature values. Results: For 100 examples of 3 different tree species, the model reached 80% classification accuracy after 30 datapoints, which was improved to 90% after increased sampling to 400 datapoints. The alternative SVM classifier achieved a maximum accuracy of 68.7%. Using data from 100 examples from each class of 6 different random point processes, the classifier achieved 96.8% accuracy, vastly outperforming the SVM. Using two outcomes in neuron spiking data, the TDA classifier was similarly accurate to the SVM in one case (both converged to 97.8% accuracy), but was outperformed in the other (relative accuracies 79.8% and 92.2%, respectively). Conclusions: This algorithm and software can be beneficial for repeated measurement data common in biological sciences, as both an accurate classifier and a feature selection tool.",

keywords = "Topological data analysis, machine learning, multiple measurement analysis, Machine learning, Multiple measurement analysis, Trees/anatomy & histology, Humans, Rats, Support Vector Machine, Machine Learning, Algorithms, Animals, Lasers, Computer Simulation, Data Analysis",

author = "Henri Riihim{\"a}ki and Wojciech Chach{\'o}lski and Jakob Theorell and Jan Hillert and Ryan Ramanujam",

note = "HR was partly supported by a collaboration agreement between the University of Aberdeen and EPFL. WC was partially supported by VR 2014-04770 and Wallenberg AI, Autonomous System and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation, G{\"o}ran Gustafsson Stiftelse. JT is fully funded by the Wenner-Gren Foundation. JH is partially supported by VR K825930053. RR is partially supported by MultipleMS. The collaboration agreement between EPFL and University of Aberdeen played a role in the design of the neuron spiking analysis and in providing the data required, i.e. the neuronal network and the spiking activity. Open access funding provided by Karolinska Institute. ",

year = "2020",

month = jul,

day = "29",

doi = "10.1186/s12859-020-03659-3",

language = "English",

volume = "21",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

}

TY - JOUR

T1 - A topological data analysis based classification method for multiple measurements

AU - Riihimäki, Henri

AU - Chachólski, Wojciech

AU - Theorell, Jakob

AU - Hillert, Jan

AU - Ramanujam, Ryan

N1 - HR was partly supported by a collaboration agreement between the University of Aberdeen and EPFL. WC was partially supported by VR 2014-04770 and Wallenberg AI, Autonomous System and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation, Göran Gustafsson Stiftelse. JT is fully funded by the Wenner-Gren Foundation. JH is partially supported by VR K825930053. RR is partially supported by MultipleMS. The collaboration agreement between EPFL and University of Aberdeen played a role in the design of the neuron spiking analysis and in providing the data required, i.e. the neuronal network and the spiking activity. Open access funding provided by Karolinska Institute.

PY - 2020/7/29

Y1 - 2020/7/29

N2 - Background: Machine learning models for repeated measurements are limited. Using topological data analysis (TDA), we present a classifier for repeated measurements which samples from the data space and builds a network graph based on the data topology. A machine learning model with cross-validation is then applied for classification. When test this on three case studies, accuracy exceeds an alternative support vector machine (SVM) voting model in most situations tested, with additional benefits such as reporting data subsets with high purity along with feature values. Results: For 100 examples of 3 different tree species, the model reached 80% classification accuracy after 30 datapoints, which was improved to 90% after increased sampling to 400 datapoints. The alternative SVM classifier achieved a maximum accuracy of 68.7%. Using data from 100 examples from each class of 6 different random point processes, the classifier achieved 96.8% accuracy, vastly outperforming the SVM. Using two outcomes in neuron spiking data, the TDA classifier was similarly accurate to the SVM in one case (both converged to 97.8% accuracy), but was outperformed in the other (relative accuracies 79.8% and 92.2%, respectively). Conclusions: This algorithm and software can be beneficial for repeated measurement data common in biological sciences, as both an accurate classifier and a feature selection tool.

AB - Background: Machine learning models for repeated measurements are limited. Using topological data analysis (TDA), we present a classifier for repeated measurements which samples from the data space and builds a network graph based on the data topology. A machine learning model with cross-validation is then applied for classification. When test this on three case studies, accuracy exceeds an alternative support vector machine (SVM) voting model in most situations tested, with additional benefits such as reporting data subsets with high purity along with feature values. Results: For 100 examples of 3 different tree species, the model reached 80% classification accuracy after 30 datapoints, which was improved to 90% after increased sampling to 400 datapoints. The alternative SVM classifier achieved a maximum accuracy of 68.7%. Using data from 100 examples from each class of 6 different random point processes, the classifier achieved 96.8% accuracy, vastly outperforming the SVM. Using two outcomes in neuron spiking data, the TDA classifier was similarly accurate to the SVM in one case (both converged to 97.8% accuracy), but was outperformed in the other (relative accuracies 79.8% and 92.2%, respectively). Conclusions: This algorithm and software can be beneficial for repeated measurement data common in biological sciences, as both an accurate classifier and a feature selection tool.

KW - Topological data analysis

KW - machine learning

KW - multiple measurement analysis

KW - Machine learning

KW - Multiple measurement analysis

KW - Trees/anatomy & histology

KW - Humans

KW - Rats

KW - Support Vector Machine

KW - Machine Learning

KW - Algorithms

KW - Animals

KW - Lasers

KW - Computer Simulation

KW - Data Analysis

UR - http://www.scopus.com/inward/record.url?scp=85088852643&partnerID=8YFLogxK

U2 - 10.1186/s12859-020-03659-3

DO - 10.1186/s12859-020-03659-3

M3 - Article

C2 - 32727348

SN - 1471-2105

VL - 21

JO - BMC Bioinformatics

JF - BMC Bioinformatics

M1 - 336

ER -

A topological data analysis based classification method for multiple measurements

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this