EAGER: Integration and analysis of high-dimensional dataset

Project Details

Description

In recent years, massive and complex datasets such as data from, facial recognition systems, autonomous cars, medical imaging, single-cell biology, etc. are increasingly dramatically. Machine learning as part of artificial intelligence has been used to combine and understand these massive and complex datasets. The current mainstream machine learning algorithms have performed well, they primarily mathematics-based and abstracted from their sources. Thus, these algorithms do not consider nor incorporate the rich knowledge from which these datasets were produced. Thus, this project aims to examine whether and how domain knowledge influences the outcomes of machine learning algorithms on combining and analyzing massive and complex datasets. If successful, this project will develop and substantially validate a domain knowledge driven computing framework. This project will enable scientists and engineers in various fields to apply their domain knowledge to better combine and analyze massive and complex datasets. Additional insights will also be generated to understand and improve the machine learning algorithms themselves. Therefore, the findings of this project will promote the progress of sciences and can directly advance biomedical fields and human health.

Technically, this project aims to address the knowledge gap in mathematics-driven integration and analysis of high-dimensional datasets. This mathematics-driven knowledge gap has limited the full and robust integration of large, high-dimensional datasets. Moreover, external validation is required for rigorous examination of tuned machine learning algorithms. However, a majority of the studies on high-dimensional biomedical datasets did not use validation, largely due to missing data. Therefore, this project will improve the integration and analysis of high-dimensional datasets using domain-knowledge based data-normalization, missing data imputation and dimensionality reduction. As a proof of principle, the project also aims to develop and validate an adaptive multimetric pipeline to integrate various types of mutiomic data using novel feature-selection and dimensionality reduction algorithms. The resulted pipeline and package will enable researchers to better understand and classify high-dimensional datasets in biomedical and other fields. The project will result in a paradigm shift because the domain-knowledge driven data normalization, data imputation and dimensionality reduction are radically different from the mainstream mathematics driven approaches. Finally, this project also aims to expose undergraduate and high-school students who are interested in Computer Science to experiences in machine learning and data science.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

StatusFinished
Effective start/end date10/1/219/30/23

Funding

  • National Science Foundation: $207,999.00
  • National Science Foundation: $199,999.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.