2024-25 Project (Bartlett & Keogh)

Machine learning for missing data



Professor Jonathan Bartlett at LSHTM
Email: Jonathan.Bartlett1@lshtm.ac.uk


Professor Ruth Keogh at LSHTM
Email: ruth.keogh@lshtm.ac.uk


 Project Summary

Missing data are a common issue complicating analysis of datasets from a variety of fields, in particular in epidemiological studies and clinical trials. Such missing data causes analyses to be less efficient and potentially lead to biased results, when the missing values differ systematically to the observed ones. To mitigate these issues, imputation methods are often used, which aim to replace the missing values with plausible values based on models. Unfortunately these models are often relatively simple, and are likely not correct, which may lead again to biased estimates.   

Machine learning (ML) techniques have in recent years gained huge attention because of their often excellent performance in prediction tasks. This PhD will investigate how ML methods could be used as a route to handling missing data in statistical analyses, by predicting the values of missing data points. There are a number of methodological challenges to doing this. This project will build upon recent exciting developments in so-called ‘double ML’ or ‘debiased ML’ methods, which use ML methods in such a way that statistical results can be unbiased and estimation of uncertainty is relatively straightforward.    The project will involve applying and extending the debiased ML approach to the task of handling missing values in one or more variables involved in a statistical analysis. This will involve gaining a deep understanding of ML methods, advanced statistical theory, missing data methods, and statistical programming. A student completing the project will thus emerge extremely well equipped for careers in a broad range of data science areas, including both further academic research and commercial/industrial opportunities.

Project Key Words

Machine learning, missing data, imputation, statistics

MRC LID Themes

  • Global Health = No
  • Health Data Science = Yes
  • Infectious Disease = No
  • Translational and Implementation Research = No


MRC Core Skills

  • Quantitative skills = Yes
  • Interdisciplinary skills = Yes
  • Whole organism physiology = No

Skills we expect a student to develop/acquire whilst pursuing this project

The project will involve learning about both advanced ML techniques and rigorous statistical theory. It will also encompass statistical programming, and may involve the development of open source packages for the statistical software R.


Which route/s is this project available for?

  • 1+4 = Yes
  • +4 = Yes

Possible Master’s programme options identified by supervisory team for 1+4 applicants:

  • LSHTM – MSc Medical Statistics

Full-time/Part-time Study

Is this project available for full-time study? Yes
Is this project available for part-time study? Yes


Particular prior educational requirements for a student undertaking this project

  • LSHTM’s standard institutional eligibility criteria for doctoral study.
  • The project will require the student to have a first class or upper second class undergraduate degree in a subject with a substantial quantitative components (e.g. Mathematics, Statistics), and an MSc (or equivalent) in Statistics, or a related programme such as Biostatistics, Medical Statistics, or Statistical Science.

Other useful information

  • Potential CASE conversion? = No


Scientific description of this research project

Project objectives 
The overall aim of this PhD project is to develop, evaluate and implement machine learning (ML) methods for handling missing data in statistical analyses. Missing data are a common issue in clinical trials, designed observational epidemiological studies, analyses of electronic health records, and more broadly in data science. Imputation methods are often viewed as the state-of-the-art for handling such missingness, but these are typically based on assuming relatively simple parametric imputation models. In practice such models are almost certainly misspecified to varying extents, leading to bias in the resulting estimates.   

ML algorithms have been shown to often perform well at prediction problems. In the context of handling missing data in statistical analysis, one may thus contemplate using machine learning, rather than simple parametric models, to predict the missing values. Unfortunately such an approach has two major drawbacks. One is that the resulting estimates suffer from a so-called regularization bias, which occurs because ML methods use regularization to avoid overfitting, and the second is that valid standard error estimation is difficult given that ML methods do not often provide valid estimates of uncertainty.   

This project will build upon recent foundational developments in so-called debiased ML (https://doi.org/10.1111/ectj.12097). These developments show how ML methods can be exploited in such a way that both of the two aforementioned issues are dealt with, and have so far been applied mainly for the problem of estimating causal effects of treatments or exposures. This PhD project will apply and further develop these methods to the problem of handling missing data in an analysis. Specifically, the student will translate and evaluate the performance of debiased ML methods for causal inference to the problem of estimating a mean of a variable subject to missingness. They will then extend this methodology to the problem of handling missingness in covariates in a regression model. The performance of the methods will be evaluated on both simulated and real epidemiological datasets suffering from missing values.   

Techniques to be used 
This project will involve the student developing a deep understanding of machine learning techniques, non and semiparametric statistical methods, and programming/software skills. These will be needed in order to develop, evaluate, and implement (in open source R software) the methods.   

Confirmed availability of databases or materials  The datasets involved in this project will be a combination of simulated datasets and epidemiological and clinical datasets which are publicly available.   

Potential risks and mitigations 
Since as mentioned above the student will use a combination of simulation and analysis of publicly available datasets, there are no risks in regards to data access. The main risk is that the theoretical extensions to the problem of handling missing data in regression models does not prove successful. Nevertheless, the supervisors have an outline plan of how this extension will work, and even if it proves in the end not to be feasible, such an outcome would still make a valuable contribution to this field, and thus also the student’s PhD thesis.

Further reading

(Relevant preprints and/or open access articles)

Additional information from the supervisory team

  • The supervisory team has provided a recording for prospective applicants who are interested in their project. This recording should be watched before any discussions begin with the supervisory team.
    Bartlett-Keogh Recording


Comments are closed.