Variable selection and risk prediction using a penalised modelling framework for high-dimensional data in a nested matched case-control design

Mintu Nath, Simon PR Romaine, Andrea Koekemoer, Adriaan A Voors, James A. Timmons, Nilesh J Samani

Research output: Contribution to conferenceOral Presentation/ Invited Talk


In a matched case-control (MCC) study, each subject with the outcome of interest (case) is matched with m subjects without the outcome (control) from the same cohort (often called a 1:m nested MCC). This design is efficient for generating genome-wide transcript or proteome profiles that are prohibitively expensive to measure in the complete cohort. Modelling high-dimensional data in an MCC setting is challenging. Researchers are interested in achieving two objectives: to conduct variable selection to identify biomarkers and then to develop a risk prediction model for a future patient. We discuss an integrated approach to meet these objectives. First, a penalised conditional logistic regression (CLR) model for variable selection in a high-dimensional space is fitted using the lasso and elastic net. Second, a risk prediction model based on the fitted CLR, incorporates data on matching criteria from the parent cohort with the MCC. The adjustment reflects the sampling probabilities of the cases and control in the MCC cohort conditional on the matching variables, compared to the full cohort. This enables the estimation of the model intercept of each case-control stratum for the matching criteria using the entire cohort and estimating the area under the curve (AUC) corresponding to the Receiver Operating Characteristic (ROC) curve for a set of pre-defined covariates. We implemented this approach on over 35,000 transcripts and clinical biomarkers to predict the risk of cardiovascular mortality in 944 heart failure patients recruited in a nested MCC design (age and sex-matched). The proposed integrated methodology achieves the desired objectives: the combined model with clinical and transcriptomic profiles selected 88 transcripts with an estimated AUC of 0.738 on a held-out sample. The approach allowed further biological investigation of putative transcripts and prediction of the risk of cardiovascular mortality conditional on selected predictors.
Original languageEnglish
Publication statusE-pub ahead of print - 12 Sep 2022
EventRSS International Conference 2022 - P&J Live, Aberdeen, United Kingdom
Duration: 12 Sep 202215 Sep 2022


ConferenceRSS International Conference 2022
Country/TerritoryUnited Kingdom
Internet address


  • matched case-control design
  • high-dimensional data
  • penalised conditional logistic regression model
  • risk prediction
  • biomarkers


Dive into the research topics of 'Variable selection and risk prediction using a penalised modelling framework for high-dimensional data in a nested matched case-control design'. Together they form a unique fingerprint.

Cite this