2026-27 Project (Matthewman & Langan)

Facilitating complex phenotyping for electronic health records research using large language models

SUPERVISORY TEAM

Supervisor

Dr Julian Matthewman at LSHTM
Faculty of Epidemiology & Population Health, Department of Non-communicable Disease Epidemiology
Email: julian.matthewman@lshtm.ac.uk

Co-Supervisor

Professor Sinéad Langan at LSHTM
Faculty of Epidemiology & Population Health, Department of Non-communicable Disease Epidemiology
Email: sinead.langan@lshtm.ac.uk

PROJECT SUMMARY

Project Summary

This project will explore the use of Large Language Models (LLMs) to improve how we identify complex diseases in large Electronic Health Record (EHR) databases. Identifying patient groups accurately, a process called phenotyping, is crucial for research but is often challenging for complex conditions using standard methods. You will develop and validate a new framework using LLMs to create more precise and nuanced disease definitions. The project involves a systematic literature review, framework development, and a case study creating a “phenotype atlas” using the UK’s CPRD Aurum database. This work will produce a valuable new methodology and resource for the health data science community, demonstrating the impact of advanced AI on epidemiological research.

Project Key Words

Phenotyping, Large Language Models, Electronic Health

MRC LID Themes

Health Data Science
Translational and Implementation Research
Global Health

Skills

MRC Core Skills

Quantitative skills
Interdisciplinary skills

Skills we expect a student to develop/acquire whilst pursuing this project:

Large Language Model (LLM) Evaluation
Electronic Health Record (EHR) Analysis
Epidemiological Methods
Quantitative Data Analysis in R or Python
Natural Language Processing (NLP)

Routes

Which route/s are available with this project?

1+4 = No
+4 = Yes

Possible Master’s programme options identified by supervisory team for 1+4 applicants:

Not applicable

Full-time/Part-time Study

Is this project available for full-time study? Yes
Is this project available for part-time study? Yes

Location & Travel

Students funded through MRC LID are expected to work on site at their primary institution. At a minimum, all students must meet the institutional research degree regulations and expectations about onsite working and under this scheme they may be expected to work onsite (in-person) more frequently. Students may also be required to travel for conferences (up to 3 over the duration of the studentship), and for any required training for research degree study and training. Other travel expectations and opportunities highlighted by the supervisory team are noted below.

Day-to-day work (primary location) for the duration of this research degree project will be at: LSHTM – Bloomsbury, London

Travel requirements for this project: None

Eligibility/Requirements

Particular prior educational requirements for a student undertaking this project

Minimum standard institutional eligibility criteria for doctoral study at LSHTM
Both a strong quantitative background and a health-related background are required. This may be demonstrated by a Master’s degree in a field such as epidemiology, health data science, medical statistics, or a related discipline
Proficiency in a programming language (e.g., Python or R)

Other useful information

Potential Industrial CASE (iCASE) conversion? = No

PROJECT IN MORE DETAIL

Scientific description of this research project

Background The use of routinely collected Electronic Health Records (EHR) enables large-scale medical research. A key step in using EHR is “phenotyping”, the process of identifying patients with a specific disease or characteristic. While standardised medical terminologies like SNOMED CT and ICD-10 provide a structure for phenotyping, they are often insufficient for complex research questions regarding, for example, disease subtypes or levels of diagnostic certainty. Consequently, creating specific phenotypes requires manual input from subject-matter experts to interpret combinations of records. Large Language Models (LLMs) can process large amounts of complex information, including clinical knowledge. This suggests they could augment the phenotyping process, particularly for the knowledge-intensive aspects that are currently time-consuming. By using their embedded clinical knowledge, LLMs could assist researchers in defining and validating phenotype algorithms more efficiently. However, the application of LLMs for this purpose is not well-established, and a systematic framework for their use in EHR phenotyping is needed.

Objectives

This PhD project aims to develop and validate a framework for using LLMs to facilitate complex disease phenotyping in EHR research. The project has four objectives:
To review and keep up to date with the current literature on the use of LLMs for health data classification and phenotyping, e.g., through a living scoping review.
To develop a transparent and reproducible framework for LLM-assisted phenotyping that integrates clinical expertise with model outputs.
To apply and validate this framework by creating an “Atlas of Complex Disease Phenotypes” for a range of conditions within CPRD Aurum as a case study, likely in a disease area of the candidates or supervisors expertise (the supervisors have expertise in particular in skin and inflammatory disease)
To compare key epidemiological metrics (e.g., incidence, prevalence) derived using the new phenotypes against those from standard, readily available disease definitions.

Techniques

Large Language Model (LLM) Evaluation: Prompt engineering, fine-tuning, and assessing LLM performance for clinical tasks.
Electronic Health Record (EHR) Analysis: Managing, processing, and analysing longitudinal patient data.
Epidemiological Methods: Calculating and interpreting measures of disease frequency (incidence, prevalence) and association.
Quantitative Data Analysis: Statistical programming in R or Python, data visualisation, and reproducible research practices.
Natural Language Processing (NLP): Techniques for processing clinical text. Data CPRD Aurum, which contains UK primary care data that can be linked to several other data sources (e.g., hospital data), captures diagnoses, symptoms, prescriptions, referrals and tests.

Risks
The risk of data access issues with CPRD Aurum is low as there is an institutional license in place.

Additional information from the supervisory team

The supervisory team has provided a recording for prospective applicants who are interested in their project. This recording should be watched before any discussions begin with the supervisory team.

Matthewman & Langan Recording

MRC LID LINKS

To apply for a studentship: MRC LID How to Apply
Full list of available projects: MRC LID Projects
For more information about the DTP: MRC LID About Us