Semi-supervised Approaches to Denoising Electronic Health Records Data for Risk Prediction

Project Details

Description

Project Summary While clinical trials remain a critical source for oncology research, their study findings may not be gener- alizable to the real world due to the restricted patient population. In recent years, due to the increasing adoption of electronic health records (EHR) and the linkage of EHR with specimen bio-repositories and other research registries, integrated large datasets now exist as a new source for translational research. These integrated datasets open opportunities for developing accurate EHR-based prediction models for disease progression and treatment response, which can be easily incorporated into clinical practice. These models can also be contrasted with models derived from clinical trials, bridging the gap between clinical trials and the real world. However, efficiently deriving and evaluating personalized prediction models using such real world data (RWD) remains challenging due to practical and methodological obstacles. For example, validated outcome information from EHR, such as development of colon cancer and 1-year treatment response, requires laborious medical record review and hence is often not readily available for research. Naive use of error prone surrogates of the outcome, such as billing codes or procedure codes, as the true outcome may greatly hamper the power of EHR studies and produce biased results. Semi-supervised risk prediction methods, leveraging noisy surrogates and a small amount of human annotations on the outcome, may greatly improve the utility of EHR for precision medicine research. Deriving a precise estimate of the risk model becomes even more challenging when the number of candidate features is not small relative to the number of annotated outcomes. Another major challenge with EHR risk modeling lies in the transportability. Complex machine learning models trained in one EHR system often attain low accuracy in another EHR system, due to the heterogeneity in the patient population and healthcare system. Transfer learning methods that can automatically adjust model developed for one EHR cohort to better fit to another EHR cohort is of great value. Synthesizing information from multiple data sources can improve the quality of evidence. However, meta analyzing EHR from multiple EHR cohorts faces an additional challenge due to patient privacy. We address these challenges by developing semi-supervised risk prediction methods with high dimensional predictions in Aim 1; semi-supervised transfer learning methods to enable risk prediction modeling in target populations with no gold standard labels uted learin Aim 2; and distributed learning methods for high dimensional predictive modeling in Aim.
StatusActive
Effective start/end date8/1/214/30/25

Funding

  • U.S. National Library of Medicine: $337,425.00
  • U.S. National Library of Medicine: $334,713.00
  • U.S. National Library of Medicine: $334,713.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.