2026-27 Project (Matthewman & Langan)
Facilitating complex phenotyping for electronic health records research using large language models
SUPERVISORY TEAM
Supervisor
Dr Julian Matthewman at LSHTM
Faculty of Epidemiology & Population Health, Department of Non-communicable Disease Epidemiology
Email: julian.matthewman@lshtm.ac.uk
Co-Supervisor
Professor Sinéad Langan at LSHTM
Faculty of Epidemiology & Population Health, Department of Non-communicable Disease Epidemiology
Email: sinead.langan@lshtm.ac.uk
PROJECT SUMMARY
Project Summary
This project will explore the use of Large Language Models (LLMs) to improve how we identify complex diseases in large Electronic Health Record (EHR) databases. Identifying patient groups accurately, a process called phenotyping, is crucial for research but is often challenging for complex conditions using standard methods. You will develop and validate a new framework using LLMs to create more precise and nuanced disease definitions. The project involves a systematic literature review, framework development, and a case study creating a “phenotype atlas” using the UK’s CPRD Aurum database. This work will produce a valuable new methodology and resource for the health data science community, demonstrating the impact of advanced AI on epidemiological research.
Project Key Words
Phenotyping, Large Language Models, Electronic Health
MRC LID Themes
- Health Data Science
- Translational and Implementation Research
- Global Health
Skills
MRC Core Skills
- Quantitative skills
- Interdisciplinary skills
Skills we expect a student to develop/acquire whilst pursuing this project:
- Large Language Model (LLM) Evaluation
- Electronic Health Record (EHR) Analysis
- Epidemiological Methods
- Quantitative Data Analysis in R or Python
- Natural Language Processing (NLP)
Routes
Which route/s are available with this project?
- 1+4 = No
- +4 = Yes
Possible Master’s programme options identified by supervisory team for 1+4 applicants:
- Not applicable
Full-time/Part-time Study
Is this project available for full-time study? Yes
Is this project available for part-time study? Yes
Location & Travel
Students funded through MRC LID are expected to work on site at their primary institution. At a minimum, all students must meet the institutional research degree regulations and expectations about onsite working and under this scheme they may be expected to work onsite (in-person) more frequently. Students may also be required to travel for conferences (up to 3 over the duration of the studentship), and for any required training for research degree study and training. Other travel expectations and opportunities highlighted by the supervisory team are noted below.
Day-to-day work (primary location) for the duration of this research degree project will be at: LSHTM – Bloomsbury, London
Travel requirements for this project: None
Eligibility/Requirements
Particular prior educational requirements for a student undertaking this project
- Minimum standard institutional eligibility criteria for doctoral study at LSHTM
- Both a strong quantitative background and a health-related background are required. This may be demonstrated by a Master’s degree in a field such as epidemiology, health data science, medical statistics, or a related discipline
- Proficiency in a programming language (e.g., Python or R)
Other useful information
- Potential Industrial CASE (iCASE) conversion? = No
PROJECT IN MORE DETAIL
Scientific description of this research project
Background The use of routinely collected Electronic Health Records (EHR) enables large-scale medical research. A key step in using EHR is “phenotyping”, the process of identifying patients with a specific disease or characteristic. While standardised medical terminologies like SNOMED CT and ICD-10 provide a structure for phenotyping, they are often insufficient for complex research questions regarding, for example, disease subtypes or levels of diagnostic certainty. Consequently, creating specific phenotypes requires manual input from subject-matter experts to interpret combinations of records. Large Language Models (LLMs) can process large amounts of complex information, including clinical knowledge. This suggests they could augment the phenotyping process, particularly for the knowledge-intensive aspects that are currently time-consuming. By using their embedded clinical knowledge, LLMs could assist researchers in defining and validating phenotype algorithms more efficiently. However, the application of LLMs for this purpose is not well-established, and a systematic framework for their use in EHR phenotyping is needed.
Objectives
- This PhD project aims to develop and validate a framework for using LLMs to facilitate complex disease phenotyping in EHR research. The project has four objectives:
- To review and keep up to date with the current literature on the use of LLMs for health data classification and phenotyping, e.g., through a living scoping review.
- To develop a transparent and reproducible framework for LLM-assisted phenotyping that integrates clinical expertise with model outputs.
- To apply and validate this framework by creating an “Atlas of Complex Disease Phenotypes” for a range of conditions within CPRD Aurum as a case study, likely in a disease area of the candidates or supervisors expertise (the supervisors have expertise in particular in skin and inflammatory disease)
- To compare key epidemiological metrics (e.g., incidence, prevalence) derived using the new phenotypes against those from standard, readily available disease definitions.
Techniques
- Large Language Model (LLM) Evaluation: Prompt engineering, fine-tuning, and assessing LLM performance for clinical tasks.
- Electronic Health Record (EHR) Analysis: Managing, processing, and analysing longitudinal patient data.
- Epidemiological Methods: Calculating and interpreting measures of disease frequency (incidence, prevalence) and association.
- Quantitative Data Analysis: Statistical programming in R or Python, data visualisation, and reproducible research practices.
- Natural Language Processing (NLP): Techniques for processing clinical text. Data CPRD Aurum, which contains UK primary care data that can be linked to several other data sources (e.g., hospital data), captures diagnoses, symptoms, prescriptions, referrals and tests.
Risks
The risk of data access issues with CPRD Aurum is low as there is an institutional license in place.
Further reading
Relevant preprints and/or open access articles:
(DOI = Digital Object Identifier)
Other pre-application materials: None
Additional information from the supervisory team
The supervisory team has provided a recording for prospective applicants who are interested in their project. This recording should be watched before any discussions begin with the supervisory team.
MRC LID LINKS
To apply for a studentship: MRC LID How to Apply
Full list of available projects: MRC LID Projects
For more information about the DTP: MRC LID About Us

