Seqenv

linking sequences to environments through text mining

Lucas Sinclair, Umer Z. Ijaz, Lars Juhl Jensen, Marco J L Coolen, Cecile Gubry-Rangin, Alica Chronakova, Anastasis Oulas, Christina Pavloudi, Julia Schnetzer, Aaron Weimann, Ali Ijaz, Alexander Eiler, Christopher Quince, Evangelos Pafilis

Research output: Contribution to journalArticle

10 Citations (Scopus)
4 Downloads (Pure)

Abstract

Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the "nt" nucleotide database provided by NCBI and, out of every hit, extracts – if it is available – the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install, go to: https://github.com/xapple/seqenv
Original languageEnglish
Article number2690
JournalPeerJ
Volume4
Early online date29 Jul 2016
DOIs
Publication statusPublished - 20 Dec 2016

Fingerprint

Data Mining
Metadata
Throughput
Plankton
Thesauri
Paleontology
Ecology
Ammonia
Controlled Vocabulary
Ontology
paleontology
Nucleotides
microbial ecology
Archaea
plankton
ammonia
Software
biogeography
nucleotides
Databases

Cite this

Sinclair, L., Ijaz, U. Z., Jensen, L. J., Coolen, M. J. L., Gubry-Rangin, C., Chronakova, A., ... Pafilis, E. (2016). Seqenv: linking sequences to environments through text mining. PeerJ, 4, [2690]. https://doi.org/10.7717/peerj.2690

Seqenv : linking sequences to environments through text mining. / Sinclair, Lucas; Ijaz, Umer Z.; Jensen, Lars Juhl; Coolen, Marco J L; Gubry-Rangin, Cecile; Chronakova, Alica; Oulas, Anastasis; Pavloudi, Christina; Schnetzer, Julia; Weimann, Aaron; Ijaz, Ali; Eiler, Alexander; Quince, Christopher; Pafilis, Evangelos.

In: PeerJ, Vol. 4, 2690, 20.12.2016.

Research output: Contribution to journalArticle

Sinclair, L, Ijaz, UZ, Jensen, LJ, Coolen, MJL, Gubry-Rangin, C, Chronakova, A, Oulas, A, Pavloudi, C, Schnetzer, J, Weimann, A, Ijaz, A, Eiler, A, Quince, C & Pafilis, E 2016, 'Seqenv: linking sequences to environments through text mining', PeerJ, vol. 4, 2690. https://doi.org/10.7717/peerj.2690
Sinclair L, Ijaz UZ, Jensen LJ, Coolen MJL, Gubry-Rangin C, Chronakova A et al. Seqenv: linking sequences to environments through text mining. PeerJ. 2016 Dec 20;4. 2690. https://doi.org/10.7717/peerj.2690
Sinclair, Lucas ; Ijaz, Umer Z. ; Jensen, Lars Juhl ; Coolen, Marco J L ; Gubry-Rangin, Cecile ; Chronakova, Alica ; Oulas, Anastasis ; Pavloudi, Christina ; Schnetzer, Julia ; Weimann, Aaron ; Ijaz, Ali ; Eiler, Alexander ; Quince, Christopher ; Pafilis, Evangelos. / Seqenv : linking sequences to environments through text mining. In: PeerJ. 2016 ; Vol. 4.
@article{4e18519011a34ec3bc9afad380323538,
title = "Seqenv: linking sequences to environments through text mining",
abstract = "Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the {"}nt{"} nucleotide database provided by NCBI and, out of every hit, extracts – if it is available – the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install, go to: https://github.com/xapple/seqenv",
author = "Lucas Sinclair and Ijaz, {Umer Z.} and Jensen, {Lars Juhl} and Coolen, {Marco J L} and Cecile Gubry-Rangin and Alica Chronakova and Anastasis Oulas and Christina Pavloudi and Julia Schnetzer and Aaron Weimann and Ali Ijaz and Alexander Eiler and Christopher Quince and Evangelos Pafilis",
note = "Funding Lucas Sinclair and Alexander Eiler were funded by the Swedish Foundation for strategic research (ICA10-0015). Umer Zeeshan Ijaz was funded by NERC IRF (NE/L011956/1). Lars Juhl Jensen was funded by the Novo Nordisk Foundation (NNF14CC0001). Evangelos Pafilis was supported by the European Commission FP7-REGPOT project MARBIGEN (grant agreement #264089) and the LifeWatchGreece Research Infrastructure (384676-94/GSRT/NSRF C&E). Christopher Quince is funded through the MRC Cloud Infrastructure for Microbial Bioinformatics (CLIMB) project (MR/L015080/1) through fellowship (MR/M50161X/1). Cecile Gubry was funded by the Environment Research Council Fellowship (NE/J019151/1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.",
year = "2016",
month = "12",
day = "20",
doi = "10.7717/peerj.2690",
language = "English",
volume = "4",
journal = "PeerJ",
issn = "2167-8359",
publisher = "PEERJ INC",

}

TY - JOUR

T1 - Seqenv

T2 - linking sequences to environments through text mining

AU - Sinclair, Lucas

AU - Ijaz, Umer Z.

AU - Jensen, Lars Juhl

AU - Coolen, Marco J L

AU - Gubry-Rangin, Cecile

AU - Chronakova, Alica

AU - Oulas, Anastasis

AU - Pavloudi, Christina

AU - Schnetzer, Julia

AU - Weimann, Aaron

AU - Ijaz, Ali

AU - Eiler, Alexander

AU - Quince, Christopher

AU - Pafilis, Evangelos

N1 - Funding Lucas Sinclair and Alexander Eiler were funded by the Swedish Foundation for strategic research (ICA10-0015). Umer Zeeshan Ijaz was funded by NERC IRF (NE/L011956/1). Lars Juhl Jensen was funded by the Novo Nordisk Foundation (NNF14CC0001). Evangelos Pafilis was supported by the European Commission FP7-REGPOT project MARBIGEN (grant agreement #264089) and the LifeWatchGreece Research Infrastructure (384676-94/GSRT/NSRF C&E). Christopher Quince is funded through the MRC Cloud Infrastructure for Microbial Bioinformatics (CLIMB) project (MR/L015080/1) through fellowship (MR/M50161X/1). Cecile Gubry was funded by the Environment Research Council Fellowship (NE/J019151/1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

PY - 2016/12/20

Y1 - 2016/12/20

N2 - Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the "nt" nucleotide database provided by NCBI and, out of every hit, extracts – if it is available – the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install, go to: https://github.com/xapple/seqenv

AB - Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the "nt" nucleotide database provided by NCBI and, out of every hit, extracts – if it is available – the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install, go to: https://github.com/xapple/seqenv

U2 - 10.7717/peerj.2690

DO - 10.7717/peerj.2690

M3 - Article

VL - 4

JO - PeerJ

JF - PeerJ

SN - 2167-8359

M1 - 2690

ER -