Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach

Benedict Chivers; John Wallbank; Steven Cole; Ondrej Sebek; Simon Stanley; Matthew Fry; Georgios Leontidis

doi:10.1016/j.jhydrol.2020.125126

Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach

Benedict Chivers, John Wallbank, Steven Cole, Ondrej Sebek, Simon Stanley, Matthew Fry, Georgios Leontidis^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

22 Citations (Scopus)

10 Downloads (Pure)

Abstract

Precipitation data collected at sub-hourly resolution represents specific challenges for missing data recovery by being largely stochastic in nature and highly unbalanced in the duration of rain vs non-rain. Here we present a two-step analysis utilising current machine learning techniques for imputing precipitation data sampled at 30-minute intervals by devolving the task into (a) the classification of rain or non-rain samples, and (b) regressing the absolute values of predicted rain samples. Investigating 37 weather stations in the UK, this machine learning process produces more accurate predictions for recovering precipitation data than an established surface fitting technique utilising neighbouring rain gauges. Increasing available features for the training of machine learning algorithms increases performance with the integration of weather data at the target site with externally sourced rain gauges providing the highest performance. This method informs machine learning models by utilising information in concurrently collected environmental data to make accurate predictions of missing rain data. Capturing complex non-linear relationships from weakly correlated variables is critical for data recovery at sub-hourly resolutions. Such pipelines for data recovery can be developed and deployed for highly automated and near instantaneous imputation of missing values in ongoing datasets at high temporal resolutions.

Original language	English
Article number	125126
Number of pages	12
Journal	Journal of Hydrology
Volume	588
Early online date	30 May 2020
DOIs	https://doi.org/10.1016/j.jhydrol.2020.125126
Publication status	Published - Sept 2020

Bibliographical note

This research was supported by a UKRI-NERC Constructing a Digital Environment Strategic Priority grant “Engineering Transformation for the Integration of Sensor Networks: A Feasibility Study” [NE/S016236/1 & NE/S016244/1].

Keywords

Machine learning
Data imputation
Environmental sensor networks
Precipitation
Soil moisture
Gradient boosted trees

Access to Document

10.1016/j.jhydrol.2020.125126Licence: Unspecified

Civers_etal_JournalHydrology_Imputation_AAM
© 2020. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/
Accepted author manuscript, 1.23 MBLicence: CC BY-NC-ND

Cite this

@article{3562c87ed3f34fefaddb3b7e2ef9a20f,

title = "Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach",

abstract = "Precipitation data collected at sub-hourly resolution represents specific challenges for missing data recovery by being largely stochastic in nature and highly unbalanced in the duration of rain vs non-rain. Here we present a two-step analysis utilising current machine learning techniques for imputing precipitation data sampled at 30-minute intervals by devolving the task into (a) the classification of rain or non-rain samples, and (b) regressing the absolute values of predicted rain samples. Investigating 37 weather stations in the UK, this machine learning process produces more accurate predictions for recovering precipitation data than an established surface fitting technique utilising neighbouring rain gauges. Increasing available features for the training of machine learning algorithms increases performance with the integration of weather data at the target site with externally sourced rain gauges providing the highest performance. This method informs machine learning models by utilising information in concurrently collected environmental data to make accurate predictions of missing rain data. Capturing complex non-linear relationships from weakly correlated variables is critical for data recovery at sub-hourly resolutions. Such pipelines for data recovery can be developed and deployed for highly automated and near instantaneous imputation of missing values in ongoing datasets at high temporal resolutions.",

keywords = "Machine learning, Data imputation, Environmental sensor networks, Precipitation, Soil moisture, Gradient boosted trees",

author = "Benedict Chivers and John Wallbank and Steven Cole and Ondrej Sebek and Simon Stanley and Matthew Fry and Georgios Leontidis",

note = "This research was supported by a UKRI-NERC Constructing a Digital Environment Strategic Priority grant “Engineering Transformation for the Integration of Sensor Networks: A Feasibility Study” [NE/S016236/1 & NE/S016244/1]. ",

year = "2020",

month = sep,

doi = "10.1016/j.jhydrol.2020.125126",

language = "English",

volume = "588",

journal = "Journal of Hydrology",

issn = "0022-1694",

publisher = "Elsevier Science B. V.",

}

TY - JOUR

T1 - Imputation of missing sub-hourly precipitation data in a large sensor network

T2 - a machine learning approach

AU - Chivers, Benedict

AU - Wallbank, John

AU - Cole, Steven

AU - Sebek, Ondrej

AU - Stanley, Simon

AU - Fry, Matthew

AU - Leontidis, Georgios

N1 - This research was supported by a UKRI-NERC Constructing a Digital Environment Strategic Priority grant “Engineering Transformation for the Integration of Sensor Networks: A Feasibility Study” [NE/S016236/1 & NE/S016244/1].

PY - 2020/9

Y1 - 2020/9

N2 - Precipitation data collected at sub-hourly resolution represents specific challenges for missing data recovery by being largely stochastic in nature and highly unbalanced in the duration of rain vs non-rain. Here we present a two-step analysis utilising current machine learning techniques for imputing precipitation data sampled at 30-minute intervals by devolving the task into (a) the classification of rain or non-rain samples, and (b) regressing the absolute values of predicted rain samples. Investigating 37 weather stations in the UK, this machine learning process produces more accurate predictions for recovering precipitation data than an established surface fitting technique utilising neighbouring rain gauges. Increasing available features for the training of machine learning algorithms increases performance with the integration of weather data at the target site with externally sourced rain gauges providing the highest performance. This method informs machine learning models by utilising information in concurrently collected environmental data to make accurate predictions of missing rain data. Capturing complex non-linear relationships from weakly correlated variables is critical for data recovery at sub-hourly resolutions. Such pipelines for data recovery can be developed and deployed for highly automated and near instantaneous imputation of missing values in ongoing datasets at high temporal resolutions.

AB - Precipitation data collected at sub-hourly resolution represents specific challenges for missing data recovery by being largely stochastic in nature and highly unbalanced in the duration of rain vs non-rain. Here we present a two-step analysis utilising current machine learning techniques for imputing precipitation data sampled at 30-minute intervals by devolving the task into (a) the classification of rain or non-rain samples, and (b) regressing the absolute values of predicted rain samples. Investigating 37 weather stations in the UK, this machine learning process produces more accurate predictions for recovering precipitation data than an established surface fitting technique utilising neighbouring rain gauges. Increasing available features for the training of machine learning algorithms increases performance with the integration of weather data at the target site with externally sourced rain gauges providing the highest performance. This method informs machine learning models by utilising information in concurrently collected environmental data to make accurate predictions of missing rain data. Capturing complex non-linear relationships from weakly correlated variables is critical for data recovery at sub-hourly resolutions. Such pipelines for data recovery can be developed and deployed for highly automated and near instantaneous imputation of missing values in ongoing datasets at high temporal resolutions.

KW - Machine learning

KW - Data imputation

KW - Environmental sensor networks

KW - Precipitation

KW - Soil moisture

KW - Gradient boosted trees

UR - http://www.scopus.com/inward/record.url?scp=85085739845&partnerID=8YFLogxK

U2 - 10.1016/j.jhydrol.2020.125126

DO - 10.1016/j.jhydrol.2020.125126

M3 - Article

SN - 0022-1694

VL - 588

JO - Journal of Hydrology

JF - Journal of Hydrology

M1 - 125126

ER -

Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this