CPRD linked data

Anonymised primary care patient data can be individually linked to secondary care and other health and area-based datasets. This linkage enables CPRD to provide a fuller picture of the patient care record to support vital public health research, informing advances in patient safety and delivery of care. CPRD is expanding its healthcare data and research services to increase both the cover of primary care data and the number of datasets that are linked and made available on a routine basis to the research community.

Data linkage in England is carried out by the Trusted Third Party NHS Digital. For further information please contact CPRD Enquiries at enquiries@cprd.com.

Linked datasets currently available include:

Source data

Publication: Padmanabhan S, Carty L, Cameron E, Ghosh RE, Williams R, Strongman H. Approach to record linkage of primary care data from Clinical Practice Research Datalink to other health-related patient data: overview and implications. Eur J Epidemiol, 2018.

Information about the non-standard linkage service

Information about the Mother-baby link and Pregnancy Register

Availability of linked data 

Linkage of CPRD primary care data with other patient level datasets is available for English practices who have consented to participate in the linkage scheme. Each individual GP practice participating in CPRD's collection of their primary care data can choose to revoke their consent for data collection at any point.

CPRD respects all patient opt-outs. Patients who have registered an opt-out will not be extracted for CPRD research or for data linkage.

We are working to release quarterly updates of priority linkages to support COVID-19 research. These priority linkages comprise the Public Health England (PHE) Second Generation Surveillance System (SGSS) COVID-19 virology test data, PHE COVID-19 Hospitalisation in England Surveillance System (CHESS), Hospital Episodes Statistics Admitted Patient Care, Office for National Statistics mortality data, and small area deprivation data.

The latest linkage set (set 21) contains ONS deaths data (to 16/11/2020), HES APC/OP/DID (to 30/10/2020), HES A&E (to 31/03/2020) SGSS and CHESS (to 29/09/2020), NCRAS cancer registrations/SACT/RTDS (to 2018) and small area data with 9,268,968 acceptable patients in the CPRD GOLD January 2021 build and 37,714,624 acceptable patients in the CPRD Aurum January 2021 build eligible for >/=1 linkage.

The availability of linked data by linkage set is summarised in the ‘Which linkage set should I use’ document below. If you are unsure about which linked dataset and/or source file should be used in your study, please contact us on enquiries@cprd.com.

Download:

(PDF, 152KB, 1 page)

Access to linked data 

Access to patient level data is dependent on approval of a study protocol via the Research Data Governance (RDG) process. All required linked data sources must be requested on the application form. Additionally, researchers who are first time users of a linked dataset must contact the CPRD Observational Research Team to discuss their requirements before submitting their application. Data are only provided by CPRD when part of a data extract is linked to CPRD primary care data. 

COVID-19 data

CPRD-linked COVID-19 datasets comprise:

1. Public Health England (PHE) Second Generation Surveillance System (SGSS) COVID-19 virology test data

2. PHE COVID-19 Hospitalisation in England Surveillance System (CHESS)

We are also working to provide CPRD primary care data linked to:

3. Intensive Care National Audit and Research Centre (ICNARC) data on COVID-19 intensive care admissions. 

Second Generation Surveillance System (SGSS)

SGSS is the national laboratory reporting system used in England to capture routine laboratory data on infectious diseases and antimicrobial resistance. The SARS-CoV-2 testing started in UK laboratories on 24/02/2020, with the SGSS data reflecting testing (swab samples, PCR test method) offered to those in hospital and NHS key workers (i.e. Pillar 1). The CPRD-SGSS linked data currently contain positive tests results only.

Access to linked SGSS data is subject to prior approval. This dataset is not covered by existing licences, and data can only be released to organisations within the UK/EU/EEA.

The latest release of CPRD-SGSS data covers the period 01/03/2020 – 29/09/20.

Please click on the link below to download the documentation relating to CPRD-SGSS data.

COVID-19 Hospitalisation in England Surveillance System (CHESS)

PHE established CHESS across all NHS Trusts in England on 15/03/2020 to collect epidemiological data on COVID-19 infection in persons requiring hospitalisation and ICU/HDU admission. Trends in hospital and critical care admission rates need to be interpreted in the context of testing recommendations, which changed over time.

Access to linked CHESS data is subject to prior approval. This dataset is not covered by existing licences, and data can only be released to organisations within the UK/EU/EEA.

The latest release of CPRD-CHESS data covers the admissions to 29/09/2020.

Please click on the link below to download the documentation relating to CPRD-CHESS data.

Further information about these COVID-19-specific datasets is available in the summary document below. 

Small area level data

Classifications based on the population characteristics of small areas or neighbourhoods (and the individuals who live there) are available for linkage to CPRD primary care data. CPRD has linked GP practice postcodes and eligible patient residence postcodes for both CPRD GOLD and CPRD Aurum to some of the most commonly requested area level data. This includes several measures of area level deprivation and a rural-urban classification, and Clinical Commissioning Group (CCGs) pseudonym (practice level, England-only) . These measures can be used as a proxy for socio-demographic and socio-economic data which are generally poorly recorded in the primary care data given they do not directly relate to a patient's care.

For each measure the postcode of the practice or patient residence is mapped to lower layer Super Output Area (LSOA), SOA in Northern Ireland or datazone (DZ) in Scotland using a postcode lookup file.

Patient postcode linked deprivation measures

Patient postcode linked measures are available for patients in English practices that have consented to participate in the linkage scheme. The latest available patient postcode of residence is mapped to an LSOA boundary. The LSOA of residence then allows linkage to the following LSOA-level deprivation measures;

  • 2004 English Index of Multiple Deprivation
  • 2007 English Index of Multiple Deprivation
  • 2010 English Index of Multiple Deprivation
  • 2015 English Index of Multiple Deprivation (composite and individual domains)
  • Townsend Deprivation Index: calculated using unadjusted 2001 census data
  • Carstairs Index using 2011 census data

Data are provided as quintiles, deciles or twentiles of the deprivation score to prevent disclosure of patient location. In order to prevent the possibility of deductive disclosure of a patients’ area of residence, researchers will only be provided with one of the above linked datasets for any one study. Access is provided by CPRD subject to approval.

Practice postcode linked deprivation measures

The general practice postcode linkages are available for all practices in CPRD GOLD and CPRD Aurum and use the general practice postcode which is linked via LSOA, SOA in Northern Ireland and datazone (DZ) in Scotland.

The general practice postcode linkage includes Clinical Commissioning Group (CCGs) pseudonym (England-only) and several well-known area-based measures of deprivation:

  • 2015 English Index of Multiple Deprivation (composite and individual domains)
  • 2016 Scottish Index of Multiple Deprivation (composite and individual domains)
  • 2017 Northern Ireland Index of Multiple Deprivation (composite and individual domains)
  • 2014 Welsh Index of Multiple Deprivation (composite and individual domains)
  • Carstairs Index: England, Wales and Scotland calculated using 2011 census data

As standard, the most recent national Indices of Deprivation are provided for each country. It is important to note that the IMD indices are not comparable between countries in the UK. Data is provided as quintiles or deciles of the deprivation score to prevent disclosure of patient location. In order to prevent the possibility of deductive disclosure of the location of a practice, researchers will only be provided with one practice level linkage for any one study. Access is provided by CPRD subject to approval.

Rural-Urban classification

It may be important to distinguish between rural and urban areas when investigating differences in social and economic characteristics of small areas. Populations can vary in their composition between urban and rural areas, as can access to services, employment and educational opportunities, and quality of life. The measures available for patient (England only) and practice postcode are:

  • 2011 England and Wales Rural-Urban classification
  • 2015 Northern Ireland Rural-Urban classification
  • 2016 Scottish Rural-Urban classification

Access is provided by CPRD subject to approval.

For more information about data linkage and prices please contact CPRD Enquiries on enquiries@cprd.com

Data from NHS Digital

NHS Digital has responsibility for standardising, collecting and publishing data and information from across the health and social care system in England.

CPRD linked data from NHS-Digital includes Hospital Episode Statistics (HES) - a database containing details of all admissions, Accident and Emergency attendances and outpatient appointments at NHS hospitals in England; ONS mortality data, and Mental Health Datasets. 

HES Admitted Patient Care data

HES Admitted Patient Care (HES APC) data contains details of all admissions to, or attendances at English NHS healthcare providers. It includes private patients treated in NHS hospitals, patients resident outside of England and care delivered by treatment centres (including those in the independent sector) funded by the NHS. All NHS healthcare providers in England, including acute hospital trusts, primary care trusts and mental health trusts provide data.

HES APC data includes the complete set of hospital episode information (admission and discharge dates, diagnoses (identifying primary diagnosis), specialists seen under and procedures undertaken) for each linked patient with a hospitalisation record. In addition, Augmented care data (intensive and/or high dependency levels of care) and Maternity data are available.

Diagnostic data recorded in HES are coded using the International Classification of Diseases version 10 (ICD10) coding frame; procedure information is coded using the UK Office of Population, Census and Surveys classification (OPCS) 4.6.

Requests for HES APC data access are subject to prior approval

The latest release of HES APC data (set 21) covers the period April 1997 to October 2020. 

Please click on the link below to download the documentation which provides an overview of the HES APC data linked to CPRD primary care patients.

More information about HES APC data can be found in the data resource profile below, and from a number of recent concordance and validation studies.

Publication: Herbert A, Wijlaars L, Zylbersztejn A, Cromwell D, Hardelid P. Data Resource Profile: Hospital Episode Statistics Admitted Patient Care (HES APC). International Journal of Epidemiology, Volume 46, Issue 4, August 2017, Pages 1093–1093i.

Publication: Thorn JC, Turner EL, Hounsome L the CAP trial group, et al. Validating the use of Hospital Episode Statistics data and comparison of costing methodologies for economic evaluation: an end-of-life case study from the Cluster randomised triAl of PSA testing for Prostate cancer (CAP). BMJ Open 2016;6:e011063

Publication: Saine, ME et al. (2019). Concordance of hospitalizations between Clinical Practice Research Datalink and linked Hospital Episode Statistics among patients treated with oral antidiabetic therapies. Pharmacoepidemiol Drug Saf. issn: 1053-8569. doi: 10.1002/pds.4853

Publication: McDonald, L, CJ Sammon, et al. (2018). Under-recording of hospital bleeding events in UK primary care: a linked Clinical Practice Research Datalink and Hospital Episode Statistics study. Clin Epidemiol 10, pp. 1155– 1168. issn: 1179-1349 (Print) 1179-1349. doi: 10.2147/clep.s170304.

Publication: Williams, R et al. (2018). Cancer recording in patients with and without type 2 diabetes in the Clinical Practice Research Datalink primary care data and linked hospital admission data: a cohort study. BMJ Open 8.5, e020827. issn: 2044-6055. doi: 10.1136/bmjopen-2017-020827.

HES Outpatient data

HES Outpatient (HES OP) data are a collection of individual records of outpatient appointments occurring in England only. The data includes information on the type of outpatient consultation appointment dates, the main specialty and treatment specialty under which the patient was treated, referral source, waiting times, clinical diagnosis and procedures performed. HES OP data can be used to support health resource utilisation studies, clarify clinical health care pathways and enable variations in the uptake of services to be evaluated, for example by gender and age.

Access to linked HES OP data is subject to prior approval.

The latest release of HES OP data (set 21) covers the period April 2003 to October 2020. 

Please click on the link below to download the documentation relating to HES Outpatient data.

Useful information can be found in the following validation study on the coverage of HES OP resource-use data in comparison to medical records from a cluster randomised trial:

Publication: Thorn JC, Turner E, Hounsome L, Walsh E , Donovan JL, Verne J, Neal DE , Hamdy FC, Martin RM, Noble SM. Validation of the Hospital Episode Statistics Outpatient Dataset in England. Pharmacoeconomics, 34 (2), 161-8, Feb 2016.

HES Accident and Emergency data

HES Accident and Emergency (HES A&E) data consists of individual records of patient care administered in the accident and emergency setting in England. These data are a subset of national A&E data collected by NHS England to monitor the national standard that 95% of patients attending A&E should wait no longer than 4 hours from arrival to admission, transfer or discharge. A&E data are submitted by A&E providers of all types in England. Data collected includes details about patients’ attendance, outcomes of attendance, waiting times, referral source, A&E diagnosis, A&E treatment (drugs prescribed not recorded), A&E investigations and Health Resource Group. HES A&E may be used to clarify the health care pathway, to quantity health resource use and costs in the emergency setting, and to assess variations in the uptake of emergency services over time.

Access to HES A&E data is subject to prior approval.

The latest release of HES A&E data (set 21) covers the period April 2007 to March 2020. 

Note: The Emergency Care Data Set (ECDS) is a new national dataset for urgent and emergency care and replaced the HES A&E dataset across England from 2019-20 financial year. ECDS will enable more detailed analysis and enhanced understanding of emergency services, and linkage to CPRD primary care data is in progress.

Please click on the link below to download the documentation relating to HES Accident & Emergency data.

HES Diagnostic Imaging Dataset

The Diagnostic Imaging Dataset (DID) is a collection of detailed information about diagnostic imaging tests, such as x-rays and MRI scans, taken from NHS providers' radiological information systems. The DID includes information on imaging tests carried out from 1 April 2012 on NHS patients in England. It does not include the images that are produced as a result of these tests. The DID captures information about referral source and patient type, details of the test (type of test and body site), plus items about waiting times for each diagnostic imaging event, from time of test request through to time of reporting. The DID enables analysis of demographic and geographic variation in access to different test types and different providers.

The DID is routinely linked to Hospital Episode Statistics (HES) through NHS Digital. This existing HES DID dataset has now been linked to CPRD primary care data enabling users to analyse patient care pathways. Access to HES DID data is subject to prior approval.

The latest release of HES DID data (set 21) covers the period April 2012 to October 2020.  

Please click on the link below to download the documentation relating to the HES Diagnostic Imaging Dataset.

Death Registration data

Death Registration data contains data from the Office for National Statistics (ONS) and includes information on the official date and causes of death (using ICD codes).

Access to ONS Death Registration data is subject to prior approval.

The latest release of ONS Death Registration Data (set 21) covers the period 2 January 1998 to 16 November 2020. 

Please note that late registration for some deaths means that the proportion of deaths captured is lower for the last year of the coverage period, and this proportion is likely to differ by age at death and cause of death. This is especially pronounced for the last 1-2 weeks of available death data which shows an under count of the total number of deaths as these data do not capture those where the registration of a death has been delayed (eg deaths referred to coroners in England, Wales and Northern Ireland, which cannot be registered until investigations have been concluded, and can result in delays of months or years).

Please click on the link below to download the documentation relating to ONS death registration data.

For more information please refer to the ONS User guide to mortality statistics, the ONS analysis exploring the impact of registration delays on mortality statistics and the associated dataset used for this report.

Further details can be found in three studies investigating the impact of the choice of data source in estimating mortality.

Publication: Gallagher, AM et al. (2019). The accuracy of date of death recording in the Clinical Practice Research Datalink GOLD database in England compared with the Office for National Statistics death registrations. Pharmacoepidemiol Drug Saf. issn: 1053-8569. doi: 10.1002/pds.4747.

Publication: Harshfield, A et al. (2018). Do GPs accurately record date of death? A UK observational analysis. BMJ Support Palliat Care. issn: 2045-435x. doi: 10.1136/bmjspcare-2018-001514.

Publication: Gallagher, AM. et al. (2016). The Impact of the Choice of Data Source in Record Linkage Studies Estimating Mortality in Venous Thromboembolism. PLoS One 11.2, e0148349. issn: 1932-6203. doi: 10.1371 / journal.pone.0148349.

Mental Health Dataset (MHDS)

The Mental Health Dataset (MHDS) is a collection of patient records of individuals who accessed secondary care adult mental health services and who are thought to be suffering from a mental illness. The data include information about the type and location of care received, different episodes of care received within a spell of illness and the events that occurred such as recording of Health of the Nation Outcome Scales (HoNOS) scores, Patient Health Questionnaire (PHQ-9) scores or diagnoses. MHDS data can be used to support research into resource utilisation and provide information about patient access to secondary mental health care services. This can be useful to understand patient pathways and consider associations between primary care and access to and outcomes recorded in secondary mental health care services.

Access to linked MHDS data is subject to the prior approval via the RDG process.

The latest release of MHDS data (set 18) covers the period April 2007 to November 2015. Due to a number of changes in the structure and variables recorded in the MHDS the data are provided in two formats. Data collected between April 2007 and March 2011 are provided in a first format and data collected between April 2011 and November 2015 are provided in a second, slightly different, format. 

Please click on the link below to download the documentation relating to the Mental Health Dataset.

Cancer data from Public Health England (PHE)

Cancer data contain data provided by Public Health England (PHE) via the National Cancer Registration and Analysis Service (NCRAS). Linked NCRAS CPRD datasets include Cancer Registration data, the Systemic Anti-Cancer Treatment (SACT) Dataset and the National Radiotherapy Dataset (RTDS).

Access to cancer data is subject to prior approval. 

Cancer registration data

The data contains a record for each registrable tumour diagnosed or treated in England, of which the NCRAS has been notified. Cancers are coded using the International Classification of Diseases for Oncology, revision 3, 2011. They are also back mapped to the tenth revision of the International Classification of Diseases version 10.

The latest release of PHE cancer registration data (set 21) covers the period January 1990 – December 2018. 

More information about the cancer registration data can be found in the data resource profile published by PHE:

Publication: Henson KE, Elliss-Brookes L, Coupland VH, Payne E, Vernon S, Rous B, Rashbass J. Data Resource Profile: National Cancer Registration Dataset in England. International Journal of Epidemiology, dyz076.

Further details can be found in three studies comparing recording of cancer across data sources.

Publication: Strongman H, Williams R, Bhaskaran K. What are the implications of using individual and combined sources of routinely collected data to identify and characterise incident site-specific cancers? a concordance and validation study using linked English electronic health records data. BMJ Open 2020; 10:e037719. doi: 10.1136/bmjopen-2020-037719

Publication: Arhi, CS, A Bottle, et al. (2018). Comparison of cancer diagnosis recording between the Clinical Practice Research Datalink, Cancer Registry and Hospital Episodes Statistics. Cancer Epidemiol 57, pp. 148–157. issn: 1877-7821. doi: 10.1016/j.canep.2018.08.009.

Publication: Margulis, AV, J Fortuny, et al. (2018a). Validation of Cancer Cases Using Primary Care, Cancer Registry, and Hospitalization Data in the United Kingdom. Epidemiology 29.2, pp. 308–313. issn: 1044-3983. doi: 10.1097/ede.0000000000000786.

Systemic Anti-Cancer Treatment (SACT) data

The SACT dataset covers chemotherapy treatment for all solid tumour and haematological malignancies, including those in clinical trials. Information is included about programme and regime of treatment, and the outcome for each treatment. In the latest linkage release (set 19) SACT data is available for patients with tumours recorded in the cancer registration data from January 2014 to December 2018. Data prior to January 2014 is also available but should be used with caution due to incomplete ascertainment during this period. 

More information about the SACT data can be found in the data resource profile published by PHE:

Publication: Bright CJ, Lawton S, Benson S, Bomb M, Dodwell D, Henson KE, McPhail S, Miller L, Rashbass J, Turnbull A, Smittenaar R. Data Resource Profile: The Systemic Anti-Cancer Therapy (SACT) Dataset. International Journal of Epidemiology, dyz137.

National Radiotherapy Dataset (RTDS)

The RTDS dataset contains records of radiotherapy services provided since April 2009, including teletherapy and brachytherapy. All radiotherapy delivered in England to patients in NHS facilities, or in private facilities where delivery was funded by the NHS, is included. Brachytherapy delivered for the treatment of non-malignant disease, radiotherapy delivered using unsealed sources, and non-therapeutic exposures delivered using radiotherapy machines (e.g. imaging) are not included. In the latest linkage release (set 19) RTDS data is available for patients with tumours recorded in the cancer registration data from April 2012 to December 2018. 

Source data 

The source data are provided to organisations that hold CPRD multi-study licences to enable researchers to ascertain which patients are eligible for linkage and to clarify the coverage periods for each data source. The linkage eligibility file (linkage_eligibility.txt) only includes patients from practices that have consented to take part in the linkage process. The file contains flags to indicate whether the patient is eligible for each individual linked data source. Some patients will not be eligible for any of the linked data sources, whereas others may be eligible for some/all of them. These data are provided so that multi-study licence users can determine the appropriate population to include in their study. The linkage coverage file (linkage_coverage.txt) indicates the start and end of coverage for each individual linked data source.

Access to source data for CPRD GOLD and/or CPRD Aurum is available to nominated users only; for access, please contact us at enquiries@cprd.com

If you are unsure about which linked dataset and/or source file should be used in your study, please contact us on enquiries@cprd.com

Mother-baby link

CPRD has developed a probabilistic mother-baby link algorithm, based on data recorded in the primary care medical record. This links likely mother-baby pairs within the CPRD GOLD database, based on family number plus maternity information from the mother’s primary care record, and the month of birth of newly registered babies.

Pregnancy Register

The Pregnancy Register is created by an algorithm which was developed jointly by CPRD and the London School of Hygiene and Tropical Medicine. The Pregnancy Register lists all pregnancies identified in the CPRD GOLD database and includes details of each one. A single record represents a unique pregnancy episode. There may be more than one episode per woman. For pregnancies resulting in live births, de-identified information of the linked babies in the CPRD Mother Baby Link are also provided.

Publication: Minassian C, Williams R, Meeraus WH, Smeeth L, Campbell OMR, Thomas SL. Methods to generate and validate a Pregnancy Register in the UK Clinical Practice Research Datalink primary care database. Pharmacoepidemiol Drug Saf, Volume 28, Number 7, p.923-933 (2019)

[Page last reviewed 17 September 2021]