Diagnoses in general practice electronic health records are recorded using standardised codes from a terminology. However, there are often many possible codes for a single diagnosis. When using electronic health records for clinical research, researchers have to search many individual diagnosis codes in order to classify patients correctly by their diagnosis.
The new coding system called SNOMED CT may make this process easier. SNOMED CT is increasingly becoming the standard system for recording diagnoses in the NHS. It includes detailed information on how each term relates to other terms, such as if one diagnosis is a subtype of another diagnosis. This means that it is possible to find all the terms for a condition such as diabetes with a simple expression ('all subtypes of diabetes') rather than listing them all out. However, the exact set of SNOMED CT terms identified by this method may not be the same as what researchers and clinicians would choose if they were doing so manually.
In this study we will investigate whether the SNOMED CT method ends up with a similar set of terms as a manual search, and when the terms are used to classify patient records in the database, whether the numbers and characteristics of patients with a particular diagnosis differs between the methods. We will initially test this technique on a random sample of patients for diagnoses of diabetes, asthma and heart failure, and if successful, expand the technique to other diseases. This will facilitate future studies using these databases.
Research studies using electronic health record databases need to identify patients with particular diagnoses using definitions based on coded entries (such as Read codes or ICD-10 codes). The lists of codes that define a diagnosis of interest have typically been defined using keyword searching (Read) or by traversing a hierarchy (ICD-10). However, diagnoses are increasingly encoded using SNOMED CT, which incorporates an ontology which encodes relationships between terms. In principle, the ontology could be used to create a set of terms for a concept based on a single term or SNOMED CT expression, but this has not previously been tested against other methods of creating sets of SNOMED CT terms of interest.
In this study we will develop a method to interpret SNOMED CT expressions and compare the resulting set of SNOMED CT concepts with those derived from keyword searching on the SNOMED CT descriptions, or from previously published phenotype definitions on the CALIBER portal (https://caliberresearch.org/portal/codelists) (mapped from Read V2 to SNOMED CT using the NHS mappings). We will compare the SNOMED CT terms selected by the two methods and use them to generate cohorts of patients from a random sample of CPRD Aurum. We will compare the number of patients, date of diagnosis, age distribution and sex distribution using the two methods. We will initially study a number of common disease examples with existing published phenotype definitions such as diabetes, asthma and heart failure, and then expand to other phenotypes used for CALIBER studies (e.g. https://github.com/spiros/chronological-map-phenotypes).
We will publish the methods and code used to interpret SNOMED CT expressions, and if our approach is successful it could be used in future research projects to assist in phenotype definition. This could make future research studies quicker and more reproducible, leading to patient benefit from higher quality research.
Health Outcomes to be Measured:
Comparison of phenotype definitions (sets of SNOMED CT terms) derived using different methods. When used to identify a patient cohort, we will compare the cohorts in terms of the number of patients, age at diagnosis, sex distribution, medications, laboratory results and comorbidities.