Providing more accurate predictions of colorectal cancer prognosis: development and application of novel methodology when using electronic health record data for disease prognosis

Study type
Protocol
Date of Approval
Study reference ID
24_003720
Lay Summary

Risk prediction models use personal traits (like age or disease stage) to work out how likely someone is to survive after being diagnosed with a disease, such as cancer. Doctors use these models to decide how best to care for a given patient. It is important for these models to be developed using data which reflects society. For example, if certain ethnicities aren’t considered, the model may predict outcomes for these people which are incorrect. It is important that these predictions are reliable, and many factors can impact how well these models perform, such as: being developed on data that doesn’t represent the target population, the impact of missing or outdated data, or the impact of risk factors (like cancer stage) changing over time.

Having complete data is important for the accuracy of these models. When data is missing, the models become less reliable. Missing data is common in healthcare settings. Data is collected when people visit the doctor or go to hospital. If someone is healthier and doesn’t visit the doctor as much, their records will have less data. This data will also be less up to date. This can affect how well prediction models work.

There are 3 main aims to this project:

1. Assess the impact that missing data, and the way it’s missing, affect model outcomes

2. Understand how having other health conditions alongside cancer alter a person's outcomes

3. Improve how disease survival gets reported, and ensuring these models serve everyone equally, including minority groups

Technical Summary

Prognostic models aid healthcare providers in estimating disease risk, and inform clinical decision making. Model-based survival estimates often fail to account for the presence of time-dependent effects, leading to invalid predictions. By utilizing large-scale electronic health records these models can be developed in populations that are theoretically representative of a whole population, whilst avoiding under-sampling. However, EHRs are vulnerable to data quality issues, including informative missingness, which impact model estimates. Some population subgroups are underserved by these predictive models. Ensuring equitable healthcare for all is vital.

Aims:
i. Develop and validate a prognostic survival model for bowel cancer, accounting for variation in ethnicity and multimorbidity status
ii. Utilize flexible parametric modelling to create more accurate survival predictions following a diagnosis of bowel cancer, incorporating time-dependent effects
iii. Develop a risk prediction webtool to communicate bowel cancer prognosis in a patient-focused manner2

Flexible parametric survival modelling will be used to develop a prognostic survival model for bowel cancer. This allows for more accurate predictions by better capturing the underlying hazard function of the data in a more complex manner. Measures of relative survival, crude probability of death, cumulative incidence functions, and all-cause survival will be estimated across a range of time points. Conditional measures will also be used to demonstrate clinically relevant applications of the derived model. Predictor selection will combine clinical expert opinion and statistical selection methods.

Significant attention will be paid to handling informative missingness in electronic health records. In doing so, the developed model will provide novel predictive capacity which is more equitable for population subgroups which have more frequently systemic missing primary care data.

A risk communication tool will be developed using the model, aiming to improve public awareness of disease prognosis and encourage uptake of early detection interventions such as the national bowel cancer screening program.

Health Outcomes to be Measured

1. 1-, 3-, 5- and 10-year relative survival, where relative survival is a measure that compares the observed survival of a group of patients diagnosed with a specific disease, typically cancer, to the expected survival of a similar group from the general population without the disease, isolating the impact of the disease on survival. (stratified by deprivation, age group, sex);

2. 1-, 3-, 5-, and 10-year all-cause survival (stratified by deprivation, age group, sex, cancer treatment, BMI, smoking status, cancer treatment options, ethnicity, comorbidity status);

3. 1-, 3-, 5-, and 10-year crude probabilities of death (stratified by deprivation, age group, sex, cancer treatment, BMI, smoking status, cancer treatment options, ethnicity, comorbidity status);

4. Prognostic model performance metrics:
a. Brier Score
b. Calibration Slope
c. Calibration-in-the-large
d. Discrimination
e. Index of Predictive Accuracy

5. Hazard ratios of model predictors;

6. Optimal lookback window for inclusion of multimorbidity as a prognostic factor;

Collaborators

Aiden Smith - Chief Investigator - University of Leicester
Aiden Smith - Corresponding Applicant - University of Leicester
Claire Lawson - Collaborator - University of Leicester
Farah Khasawneh - Collaborator - University of Leicester
Karen Brown - Collaborator - University of Leicester
Mark Rutherford - Collaborator - University of Leicester
Paul Lambert - Collaborator - University of Leicester
Timothy Morris - Collaborator - University College London ( UCL )

Linkages

NCRAS Cancer Registration Data;NCRAS Systemic Anti-Cancer Treatment (SACT) data;ONS Death Registration Data;Patient Level Index of Multiple Deprivation;CPRD GOLD Ethnicity Record