Defining your study population

Learning objectives

By the end of this module, the reader will have learnt:

What coding systems are used in CPRD data?

There are multiple coding systems used in healthcare data: 

CPRD primary care databases

Medical dictionary
Contains information on all medical history observations recorded by the GP using: 

Product dictionary 
Contains information on drug and appliance prescriptions recorded by the GP using:


(Vision® software)

Read version (v) 2 codes and the descriptions of all medical codes referenced in the data files as ‘medcode’ Gemscript product code system (brand and generic name) and descriptions of all product codes referenced in the data files as ‘prodcode’

CPRD Aurum

(EMIS Web® software)

A combination of SNOMED, Read and local EMIS® codes

Dictionary of Medicines and Devices (DM+D)



CPRD linked datasets

Coding systems used 

Hospital Episode Statistics (HES) Admitted Patient Care (APC) data

Diagnostic data recorded in HES are coded using the International Classification of Diseases version 10 (ICD-10) coding frame; procedure information is coded using the UK Office of Population, Census and Surveys classification (OPCS) 4.6

HES Outpatient data

Diagnostic data recorded using ICD-10
HES Accident and Emergency data Most diagnostic data are recorded using an A&E specific coding system, that can be obtained from NHS England. Some diagnostic data is recorded using ICD-10 and Read codes
HES Diagnostic Imaging Dataset Diagnostic imaging test conducted recorded using SNOMED-CT codes (no diagnoses are recorded in these data)
ONS Death Registration data Diagnostic data recorded using ICD-10
NCRAS Cancer Registration data Cancers are coded using the International Classification of Diseases for Oncology. They are also back mapped to ICD-10
NCRAS Systemic Ant-Cancer Therapy Dataset (SACT) Diagnostic data recorded using ICD-10
NCRAS National Radiotherapy Dataset (RTDS)

Diagnostic data recorded using ICD-10



More information about the specific coding systems used in CPRD primary care data are detailed in the CPRD GOLD and CPRD Aurum data specifications available at, and those used in each linked dataset are detailed in each linkage’s documentation available at

It is worth considering the hierarchy of each coding system and how its structured when building code lists to effectively define a patient cohort.

How are patient populations defined?

CPRD provides the anonymised patient-level data for patients defined and justified in a study protocol. Researchers using CPRD data need to build code lists defining the patient population(s) required for their study. This definition may use a combination of primary care codes (i.e. CPRD medical and product codes, entity types, etc.) and linked data sources (e.g. ICD-10, OPCS). To run study population definitions in the primary care data, researchers should generate codelists of CPRD medcodes (for CPRD GOLD) and CPRD medcodeids (for CPRD Aurum), which can then be run using the online tools. Read and SNOMED codelists that have not been converted to the relevant medical codes will not return relevant results.

The code lists should include all codes that would define patients with the condition of interest rather than the most pertinent codes. We recommend as a first step to create a list of synonyms for the condition of interest and search for these synonyms and truncated versions using the wildcard.

In CPRD GOLD, some patient test records are recorded as entity types, the full entity file is provided as part of the Lookup files upon request. For a thorough search of test records in CPRD GOLD, we recommend using both medical code lists and entity types. In CPRD Aurum, all tests are recorded using medical codes (there are no entity types).

Researchers should consider how GPs enter patient data when planning their cohort definition. 

Which codes should you use?

While CPRD can advise regarding the best approach for creating code lists, the study researchers will need to decide which code(s) are relevant to their study. CPRD recommends that researchers liaise with UK clinicians to understand how their patients of interest are treated, managed, and their data recorded within the UK healthcare system, in order to find the data they require and how to define their study population.

Imagine doing a study to look at whether a particular drug causes upper gastrointestinal (GI) bleed, which of the CPRD GOLD medical codes, in the table below, would you include in your study?


Read code  



1955.11  Heartburn symptom



11718 196B.00 Painful rectal bleeding
3097 J68..00 Gastrointestinal haemorrhage
2712 J680.11 Vomiting of blood
3869 E264400 Psychogenic dyspepsia
11124 J110111 Bleeding acute gastric ulcer
48951 J121100 Chronic duodenal ulcer with haemorrhage
37299 4737.11 Melaena – O/E of faeces
36583 J111111 Bleeding chronic gastric ulcer



Consider the range and variety of codes used by GPs when defining a specific diagnosis or treatment for your study – deciding what individual codes represent is not always clear cut. Some codes, such as ‘Bleeding acute gastric ulcer’ clearly relate to upper GI bleeding. But ‘Painful rectal bleeding’ is not an upper GI issue and can be immediately ruled out. Several of the codes are imprecise, for example, ‘Gastrointestinal haemorrhage’ does not specify which part of the GI tract is affected, so researchers must make a judgement regarding whether to include them or not.

Developing code lists is a crucial part of most studies but is not always straightforward and needs careful consideration and discussion with experts. 

How to find the codes used in primary care data

CPRD has developed a code browser tool specifically for the medical codes and product codes used in CPRD GOLD and CPRD Aurum primary care data. Researchers should explore the code browser tool to see if their conditions or treatments of interest have codes which are recorded by GPs in their patients’ records. The code dictionaries are updated with each CPRD primary care database release to include any new codes introduced by GP software systems – this can be provided upon request, please contact so that download credentials can be set up for you (credentials expire within 7 days).

There are separate code browser tools for CPRD GOLD and CPRD Aurum because the data is coded differently at source in the different GP software systems underlying these databases and using separate dictionaries. Researchers using both CPRD GOLD and CPRD Aurum for their research will need to create and use separate code lists for each database.

For tips on using the code browser tool, please see the Code Browser quick user guide.  

Download: CPRD Code Browser quick user guide (PDF, 960KB, 18 pages)

How to define patients in linked data

CPRD does not routinely provide users with access to these coding frames as we currently have access under licence from NHS England Business Support Authority. Users are therefore advised to seek access to these codes via licence with NHS England or to search the internet for free resources.  A free version of the ICD-10 dictionary can be found on the World Health Organization website The ICD-10 and OPCS code dictionary can also be downloaded from NHS England Technology Reference Update Distribution (TRUD) (

Page last reviewed