Classification and prediction of missing ethnicity in diabetic patients from electronic health record data

Date of ISAC Approval: 
Lay Summary: 
There are known ethnic disparities in both the proportion of people (prevalence) affected by type 2 diabetes, and the risk of major cardiovascular outcomes and death. As type 2 diabetes is largely managed in primary care, the recording of ethnicity in primary care Electronic Health Records is vital in identifying ethnic patterning of the overall number of people affected and of the number of new cases of type 2 diabetes and its management, and outcomes. Across the UK National Health Service, ethnicity is typically grouped into the 6 main categories of the UK Census: white, south Asian, black African/Caribbean, mixed, other or unknown. Previous work in the CPRD has identified that approximately 30% of all individuals with type 2 diabetes have unknown (missing) ethnicity. When trying to assess ethnic differences in diabetes management, it is possible that such missingness may bias the results. The aim of this project is to characterise the unknown ethnic group, i.e. how the measured variables differ/are similar between ethnic groups. In practice this will be pursued by using classification and prediction of ethnic groups based on baseline variables such as blood measurements, demographics, and comorbidities. Determining if/ which variables influence the classification/prediction of unknown ethnicity will be crucial in informing the decisions about how to deal with unknown ethnicity when trying to answer substantive questions about ethnic differences in diabetes management in the future.
Technical Summary: 
The aim of this project is to investigate how to characterize patients of different ethnic groups, including unknown ethnicity using multivariate statistical methods like principal component and cluster analyses on different sets of baseline variables such as biomarkers, demographics, comorbidities, separately and combined Such multivariate techniques will allow us to explore whether and which set of variables allows us to identify groupings corresponding to the different ethnic groups. Subsequently discriminant analyses techniques will be applied depending on the results of the clustering, with the aim of predicting ethnicity/missing ethnicity. Such techniques can be parametric (e.g. linear discriminant analysis, logistic regression) and non-parametric (e.g. classification trees, random forests). Considerations about the assumptions required and interpretative issues will lead us to use either type of techniques and/or compare the performances of different prediction techniques. Which variables contribute to the classification of missing vs non-missing will also be an important question we will be addressing by these multivariate techniques. The impact of using general practice level information as a proxy for patient level information (for example using the deprivation score of the general practice postcode instead of the patient’s home postcode), will also be assessed when using such variables for clustering and classification. This work will help us assess how to deal with missing ethnicity and will inform decisions about the use of imputation methods for unknown ethnic group in future causal analyses on the effect of ethnicity in particular and provide an insight on the impact of coarser measurement of the exposure.
Health Outcomes to be Measured: 

Dr Luigi Palla - Chief Investigator - London School of Hygiene & Tropical Medicine (LSHTM)
Professor Liam Smeeth - Collaborator - London School of Hygiene & Tropical Medicine (LSHTM)
Dr Luigi Palla - Corresponding Applicant - London School of Hygiene & Tropical Medicine (LSHTM)
Dr Rohini Mathur - Collaborator - London School of Hygiene & Tropical Medicine (LSHTM)
Ms Samantha Kwong - Collaborator - London School of Hygiene & Tropical Medicine (LSHTM)

Patient IMD;Practice IMD (Standard)