Research

My research interests are in missing data and data confidentiality with interests in Bayesian methods more generally. A brief overview of my research projects are provided below with some selected publications. For a complete list of publications please see my google scholar page. Importantly, I enjoy working collaboratively and in particular mentoring early career researchers in my research teams. Details of research team members involved in the relevant research projects are also provided below.

Missing data

Missing values are a pervasive problem that affects almost all areas of applied scientific research. Its presence inhibits the ability to undertake the necessary analysis and achieve the required objectives. My research here seeks to develop methods that can best deal with the missing data problem and adequately account for the additional uncertainty arising in any subsequent inferences and conclusions made using the incomplete data. Some challenges I am currently working on are addressing problems arising from Structured Missingness and Missing Not and Random (MNAR) mechanisms.

Structured Missingness

As demand increases to process and analyse ever larger and more complex data sources, often combined at scale across a range of different modalities, missing values are likely to present significant challenges and impede effective use of the data. In particular, the phenomenon of Structured Missingness (SM), where the missing values themselves exhibit an underlying association, is increasingly encountered. As Structured Missingness Research Theme Lead at the Alan Turing Insitute (ATI) I lead a programe of research to investigate and develop solutions for this problem. We hosted a series of community workshops to advance our understanding, and subsequently authored a Nature Machine Intelligence paper drawing upon a breadth of expertise to conceptualise SM, as well as describe key challenges that need to be addressed,.

I also lead an ATI grant that looks at developing modelling and imputation strategies that account for SM. The core research team involved with this component of research are:

  • James Jackson
  • Elrozy Andrinopoulou (Co-investigator)
  • Ana Basiri (Co-investigator)
  • Niels Hagenbuch (Roche collaborator)
  • Sarah McGough (Roche collaborator)
  • Chris Harbron (Roche collaborator and partnership lead)

  • Importantly we can view SM both as a problem to be addressed but also as a source of information to be leveraged itself. As a first step we have authored a paper where we provide a complete characterisation of SM and a pre-print is available here .

    Addressing the missing data problem at the design stage

    Research team:

  • Kim Lee
  • Stefanie Biedermann (Co-supervisor)

  • When possible, approaching the missing data problem from a design perspective offers some important advantages, essentially mitigating the problem at conception before any data has been collected. For some of my work on this area see the following papers published in Statistica Sinica here and here, as well as in Statistics in Medicine here.

    Optimal recovery sampling to test for MNAR

    Research team:

  • Jack Noonan
  • Stefanie Biedermann (Co-investigator)

  • Missing data arising under MNAR is a very challenging problem and often results in biased inferences. Crucially, MNAR is an untestable assumption, i.e. you will not not be able to determine whether it is present or not based solely on the original incomplete sample. We recently received funding from EPSRC to develop methodology to optimise tests for the presence of MNAR based on recovery sampling and experimental design methods. Details about the project are available here and for a preprint on our work see here.

    Data confidentiality

    We are increasingly living in a world where access to data is essential to drive our decisions, from formulating public policy, to developing state of the art scientific research methods. However much of the data comprises confidential information and are collected by statistical agencies that are obliged to protect the privacy of respondents. A common approach is to apply a statistical disclosure control method to the data prior to release such as recoding/coarsening or data swapping. While this may limit confidentiality risks this often reduces the utility of the data. There is thus typically a natural tension between the need to provide useful data to users, while also protect the confidentiality of the data which motivates my research interests in this area.

    I am particularly interested in exploring synthetic data methods, which offers some appealing features over traditional approaches here. In particular, I explore developing statistical models that can be used to simulate synthetic representatives of the original data. In theory, provided the model captures the relationships present in the original these should be reflected in the synthetic data, leading to data with high utilty. Additionally, as the synthetic data comprise simulated values confidentiality risks should be minimised. In practice, developing an optimal implementation is challenging!

    Developing synthetic data methods for large confidential administrative databases

    Research team:

  • James Jackson
  • Brian Francis (Co-supervisor)

  • I received an ESRC CASE studentship (2019-2022) to investigate methods to synthesize large administrative databases that involved the Office for National Statistics as Industrial Partners. We developed an innovative series of methods based on saturated models that typically do not serve much practical purpose but have some unique advantages over existing synthetic data methods. In particular, we are able to easily synthesise large tabular administrative databases, often in just a few seconds, with just a few tuning parameters (facilitating implementation by synthesisers) and which come with well understood mathematical and statistical properties, allowing a priori properties of the synthesis to be established (i.e. before performing any synthesis). For details on the methods see our paper published in the Journal of the Royal Statistical Society Series A here as well as further details published in the Lecture Notes in Computer Science here.

    Bayesian methods

    I often approach problems using a Bayesian perspective, it allows intuitive inferences to be made, making it easy to communicate results to practitioners and readily address the applied scientific research questions.

    Bayesian model-based clustering for multiple network data

    Research team:

  • Anastasia Mantziou
  • Simon Lunagomez (Co-supervisor)

  • In a recent PhD project we developed methods to make inferences with multiple network data, i.e. where each data point is a network. A measurement error formulation allowed us to incorporate heterogeniety in the data through clustering the networks and importantly allowed us to the characterise each cluster with a network representative to faciliate cluster inference. A paper on this this has recently been accepted in the Annals of Applied Statistics and a pre-print can be obtained here .

    Properiety of Cauchy priors in Bayesian logistic regression

    In another collaborative project we determined posterior propriety of Cauchy priors in logistic regression when the problem of separation is likely, and which has received a lot of attention. In particular, Professor Andrew Gelman commented on it in his blog . The paper was published in Bayesian Analysis and can be obtained here .