Genetic classification of populations using supervised learning

The International Schizophrenia Consortium (ISC)

Research output: Contribution to journalArticle

11 Citations (Scopus)
5 Downloads (Pure)

Abstract

There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case-control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.

Original languageEnglish
Article numbere14802
Pages (from-to)1-12
Number of pages12
JournalPloS ONE
Volume6
Issue number5
DOIs
Publication statusPublished - 12 May 2011

Fingerprint

Supervised learning
Population Genetics
Principal component analysis
learning
Learning
Quality control
Genes
Population
Principal Component Analysis
Support vector machines
Genome-Wide Association Study
principal component analysis
methodology
Quality Control
Neural networks
quality control
Bulgaria
Genetic Structures
Scotland
case-control studies

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

The International Schizophrenia Consortium (ISC) (2011). Genetic classification of populations using supervised learning. PloS ONE, 6(5), 1-12. [e14802]. https://doi.org/10.1371/journal.pone.0014802

Genetic classification of populations using supervised learning. / The International Schizophrenia Consortium (ISC).

In: PloS ONE, Vol. 6, No. 5, e14802, 12.05.2011, p. 1-12.

Research output: Contribution to journalArticle

The International Schizophrenia Consortium (ISC) 2011, 'Genetic classification of populations using supervised learning', PloS ONE, vol. 6, no. 5, e14802, pp. 1-12. https://doi.org/10.1371/journal.pone.0014802
The International Schizophrenia Consortium (ISC). Genetic classification of populations using supervised learning. PloS ONE. 2011 May 12;6(5):1-12. e14802. https://doi.org/10.1371/journal.pone.0014802
The International Schizophrenia Consortium (ISC). / Genetic classification of populations using supervised learning. In: PloS ONE. 2011 ; Vol. 6, No. 5. pp. 1-12.
@article{1b10b3bb5ffb4cfcb6ddd91bc479d3d9,
title = "Genetic classification of populations using supervised learning",
abstract = "There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case-control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.",
author = "Michael Bridges and Heron, {Elizabeth A.} and Colm O'Dushlaine and Ricardo Segurado and Derek Morris and Aiden Corvin and Michael Gill and Carlos Pinto and Morris, {Derek W.} and Colm O'Dushlaine and Elaine Kenny and Quinn, {Emma M.} and Michael Gill and Aiden Corvin and O'Donovan, {Michael C.} and Kirov, {George K.} and Craddock, {Nick J.} and Holmans, {Peter A.} and Williams, {Nigel M.} and Lucy Georgieva and Ivan Nikolov and N. Norton and H. Williams and Draga Toncheva and Vihra Milanova and Owen, {Michael J.} and Hultman, {Christina M.} and Paul Lichtenstein and Thelander, {Emma F.} and Patrick Sullivan and Andrew McQuillin and Khalid Choudhury and Susmita Datta and Jonathan Pimm and Srinivasa Thirumalai and Vinay Puri and Robert Krasucki and Jacob Lawrence and Digby Quested and Nicholas Bass and Hugh Gurling and Caroline Crombie and Gillian Fraser and Kuan, {Soh Leh} and Nicholas Walker and {St Clair}, David and Blackwood, {Douglas H.R.} and Muir, {Walter J.} and McGhee, {Kevin A.} and Ben Pickard and {The International Schizophrenia Consortium (ISC)}",
note = "Funding: This project has not been directly funded by any agency. The authors employed on research contracts are supported by the Wellcome Trust (http://www.wellcome.ac.uk), Science Foundation Ireland (http://www.sfi.ie), and the UK Science and Technology Research Council (http://www.stfc.ac.uk). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.",
year = "2011",
month = "5",
day = "12",
doi = "10.1371/journal.pone.0014802",
language = "English",
volume = "6",
pages = "1--12",
journal = "PloS ONE",
issn = "1932-6203",
publisher = "PUBLIC LIBRARY SCIENCE",
number = "5",

}

TY - JOUR

T1 - Genetic classification of populations using supervised learning

AU - Bridges, Michael

AU - Heron, Elizabeth A.

AU - O'Dushlaine, Colm

AU - Segurado, Ricardo

AU - Morris, Derek

AU - Corvin, Aiden

AU - Gill, Michael

AU - Pinto, Carlos

AU - Morris, Derek W.

AU - O'Dushlaine, Colm

AU - Kenny, Elaine

AU - Quinn, Emma M.

AU - Gill, Michael

AU - Corvin, Aiden

AU - O'Donovan, Michael C.

AU - Kirov, George K.

AU - Craddock, Nick J.

AU - Holmans, Peter A.

AU - Williams, Nigel M.

AU - Georgieva, Lucy

AU - Nikolov, Ivan

AU - Norton, N.

AU - Williams, H.

AU - Toncheva, Draga

AU - Milanova, Vihra

AU - Owen, Michael J.

AU - Hultman, Christina M.

AU - Lichtenstein, Paul

AU - Thelander, Emma F.

AU - Sullivan, Patrick

AU - McQuillin, Andrew

AU - Choudhury, Khalid

AU - Datta, Susmita

AU - Pimm, Jonathan

AU - Thirumalai, Srinivasa

AU - Puri, Vinay

AU - Krasucki, Robert

AU - Lawrence, Jacob

AU - Quested, Digby

AU - Bass, Nicholas

AU - Gurling, Hugh

AU - Crombie, Caroline

AU - Fraser, Gillian

AU - Kuan, Soh Leh

AU - Walker, Nicholas

AU - St Clair, David

AU - Blackwood, Douglas H.R.

AU - Muir, Walter J.

AU - McGhee, Kevin A.

AU - Pickard, Ben

AU - The International Schizophrenia Consortium (ISC)

N1 - Funding: This project has not been directly funded by any agency. The authors employed on research contracts are supported by the Wellcome Trust (http://www.wellcome.ac.uk), Science Foundation Ireland (http://www.sfi.ie), and the UK Science and Technology Research Council (http://www.stfc.ac.uk). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

PY - 2011/5/12

Y1 - 2011/5/12

N2 - There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case-control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.

AB - There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case-control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.

UR - http://www.scopus.com/inward/record.url?scp=79955939428&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0014802

DO - 10.1371/journal.pone.0014802

M3 - Article

VL - 6

SP - 1

EP - 12

JO - PloS ONE

JF - PloS ONE

SN - 1932-6203

IS - 5

M1 - e14802

ER -