Novel statistical methodology for the analysis of medical data

Unit Head
Professor John Hopper

View researcher's webpage

Project Details

The aim of the project is to develop novel statistical methodology and theory for the improvement of regression and association analysis, with particular application to genetic, twin and health data.

The project has two primary objectives:

To develop novel statistical techniques for the analysis and interpretation of regression and association models, to explore the theoretical properties of these methods, and to make these methods available to researchers through the creation of appropriate software tools.
To apply these techniques to the analysis of medical data from a wide range of diseases and in a wide range of settings, i.e., genome wide association studies, twin studies, etc.

Project summary

The aim of this project is to support the development of novel statistical methodology to improve the analysis of medical data, and to introduce modern cross-disciplinary ideas, such as those from machine learning, to the medical setting. As such, there are four broad sub-projects being undertaken:

(i) Development of novel approaches to variable selection in potentially high dimensional regression problems. This sub-project explores novel resampling and information theoretic techniques to assist in the selection of important exposure variables in the setting in which there is a very large number of exposures. This work, which is applicable to penalised regression approaches as well as other non-parametric methods such as decision trees, has formed the basis of the DEPTH procedure for analysis of genome-wide association studies being undertaken at the Centre for Epidemiology and Biostatistics.

(ii) Cross-disciplinary application of machine learning methods to the analysis of medical data. In this sub-project modern machine learning methods, such as decision trees/forests and mixture modelling are being applied to extract more information from costly medical data. As part of this project software has been developed to enable medical researchers in the School to make use of these novel approaches, and several collaborations within the School are currently undergoing.

(iii) Inference on Causality by Examination of Familial Confounding (ICE-FALCON), a regression approach to infer causation utilizing twins or family data: Risk factors for disease and determinants of health-related outcomes usually are correlated, at least to some extent, within relatives, especially twin pairs. If a predictor (risk factor or determinant) is causal for the outcome, the familial association induces a cross-trait cross-pair correlation between the predictor of one relative and the outcome of the other relative. However, for a given relative, if their predictor status is known, the predictor status of their relative is no longer predictive of their own outcome. We are using this concept, and data on pairs of relatives such as twins and siblings, to try to determine evidence consistent with measured predictors being causal.