ARK

Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

David Koslicki, Saikat Chatterjee, Damon Shahrivar, Alan W Walker, Suzanna C Francis, Louise J Fraser, Mikko Vehkaperä, Yueheng Lan, Jukka Corander

Research output: Contribution to journalArticle

2 Citations (Scopus)
3 Downloads (Pure)

Abstract

MOTIVATION: Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.

RESULTS: There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.

AVAILABILITY: An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

Original languageEnglish
Article numbere0140644
JournalPloS ONE
Volume10
Issue number10
DOIs
Publication statusPublished - 23 Oct 2015

Fingerprint

bacterial communities
Agglomeration
Chemical analysis
Cluster Analysis
Compressed sensing
Convex optimization
Programming Languages
Ecology
methodology
Clustering algorithms
Computer programming languages
artificial intelligence
microbial ecology
Learning systems
Computational complexity
rRNA Genes
sampling
Genes
Throughput
Statistics

Cite this

Koslicki, D., Chatterjee, S., Shahrivar, D., Walker, A. W., Francis, S. C., Fraser, L. J., ... Corander, J. (2015). ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. PloS ONE, 10(10), [e0140644]. https://doi.org/10.1371/journal.pone.0140644

ARK : Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. / Koslicki, David; Chatterjee, Saikat; Shahrivar, Damon; Walker, Alan W; Francis, Suzanna C; Fraser, Louise J; Vehkaperä, Mikko; Lan, Yueheng; Corander, Jukka.

In: PloS ONE, Vol. 10, No. 10, e0140644, 23.10.2015.

Research output: Contribution to journalArticle

Koslicki, D, Chatterjee, S, Shahrivar, D, Walker, AW, Francis, SC, Fraser, LJ, Vehkaperä, M, Lan, Y & Corander, J 2015, 'ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition', PloS ONE, vol. 10, no. 10, e0140644. https://doi.org/10.1371/journal.pone.0140644
Koslicki D, Chatterjee S, Shahrivar D, Walker AW, Francis SC, Fraser LJ et al. ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. PloS ONE. 2015 Oct 23;10(10). e0140644. https://doi.org/10.1371/journal.pone.0140644
Koslicki, David ; Chatterjee, Saikat ; Shahrivar, Damon ; Walker, Alan W ; Francis, Suzanna C ; Fraser, Louise J ; Vehkaperä, Mikko ; Lan, Yueheng ; Corander, Jukka. / ARK : Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. In: PloS ONE. 2015 ; Vol. 10, No. 10.
@article{4140534ac4cc4c4c9e69a21f92139b39,
title = "ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition",
abstract = "MOTIVATION: Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.RESULTS: There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.AVAILABILITY: An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.",
author = "David Koslicki and Saikat Chatterjee and Damon Shahrivar and Walker, {Alan W} and Francis, {Suzanna C} and Fraser, {Louise J} and Mikko Vehkaper{\"a} and Yueheng Lan and Jukka Corander",
note = "Funding: This work was supported by the Swedish Research Council Linnaeus Centre ACCESS (S.C.), ERC grant 239784 (J.C.), the Academy of Finland Center of Excellence COIN (J.C.), the Academy of Finland (M.V.), the Scottish Government’s Rural and Environment Science and Analytical Services Division (RESAS) (A.W.W), and the UK MRC/DFID grant G1002369 (S.C.F). L.J.F. received funding in the form of salary from Illumina Cambridge Ltd. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.",
year = "2015",
month = "10",
day = "23",
doi = "10.1371/journal.pone.0140644",
language = "English",
volume = "10",
journal = "PloS ONE",
issn = "1932-6203",
publisher = "PUBLIC LIBRARY SCIENCE",
number = "10",

}

TY - JOUR

T1 - ARK

T2 - Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

AU - Koslicki, David

AU - Chatterjee, Saikat

AU - Shahrivar, Damon

AU - Walker, Alan W

AU - Francis, Suzanna C

AU - Fraser, Louise J

AU - Vehkaperä, Mikko

AU - Lan, Yueheng

AU - Corander, Jukka

N1 - Funding: This work was supported by the Swedish Research Council Linnaeus Centre ACCESS (S.C.), ERC grant 239784 (J.C.), the Academy of Finland Center of Excellence COIN (J.C.), the Academy of Finland (M.V.), the Scottish Government’s Rural and Environment Science and Analytical Services Division (RESAS) (A.W.W), and the UK MRC/DFID grant G1002369 (S.C.F). L.J.F. received funding in the form of salary from Illumina Cambridge Ltd. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

PY - 2015/10/23

Y1 - 2015/10/23

N2 - MOTIVATION: Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.RESULTS: There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.AVAILABILITY: An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

AB - MOTIVATION: Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.RESULTS: There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.AVAILABILITY: An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

U2 - 10.1371/journal.pone.0140644

DO - 10.1371/journal.pone.0140644

M3 - Article

VL - 10

JO - PloS ONE

JF - PloS ONE

SN - 1932-6203

IS - 10

M1 - e0140644

ER -