Project Details
Description
Project Summary
While clinical trials remain a critical source for oncology research, their study findings may not be gener-
alizable to the real world due to the restricted patient population. In recent years, due to the increasing adoption
of electronic health records (EHR) and the linkage of EHR with specimen bio-repositories and other research
registries, integrated large datasets now exist as a new source for translational research. These integrated
datasets open opportunities for developing accurate EHR-based prediction models for disease progression
and treatment response, which can be easily incorporated into clinical practice. These models can also be
contrasted with models derived from clinical trials, bridging the gap between clinical trials and the real world.
However, efficiently deriving and evaluating personalized prediction models using such real world data (RWD)
remains challenging due to practical and methodological obstacles. For example, validated outcome
information from EHR, such as development of colon cancer and 1-year treatment response, requires
laborious medical record review and hence is often not readily available for research. Naive use of error prone
surrogates of the outcome, such as billing codes or procedure codes, as the true outcome may greatly hamper
the power of EHR studies and produce biased results. Semi-supervised risk prediction methods, leveraging
noisy surrogates and a small amount of human annotations on the outcome, may greatly improve the utility of
EHR for precision medicine research. Deriving a precise estimate of the risk model becomes even more
challenging when the number of candidate features is not small relative to the number of annotated outcomes.
Another major challenge with EHR risk modeling lies in the transportability. Complex machine learning models
trained in one EHR system often attain low accuracy in another EHR system, due to the heterogeneity in the
patient population and healthcare system. Transfer learning methods that can automatically adjust model
developed for one EHR cohort to better fit to another EHR cohort is of great value. Synthesizing information
from multiple data sources can improve the quality of evidence. However, meta analyzing EHR from multiple
EHR cohorts faces an additional challenge due to patient privacy. We address these challenges by developing
semi-supervised risk prediction methods with high dimensional predictions in Aim 1; semi-supervised transfer
learning methods to enable risk prediction modeling in target populations with no gold standard labels uted
learin Aim 2; and distributed learning methods for high dimensional predictive modeling in Aim.
Status | Active |
---|---|
Effective start/end date | 8/1/21 → 4/30/25 |
Funding
- U.S. National Library of Medicine: $337,425.00
- U.S. National Library of Medicine: $334,713.00
- U.S. National Library of Medicine: $334,713.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.