Many methods for high-dimensional data analysis begin with the assumption that the parameter of interest is, in some sense, sparse. Furthermore, the performance of many of these methods depends on the sparsity of the underlying parameters. However, statistical methods for checking sparsity assumptions and determining the implications of the absence or near-absence of sparsity are lacking. The driving goal of this project is to develop practical statistical tools for identifying situations where the relevant parameters are in fact sparse, or where sparse methods for high-dimensional data analysis may be applied effectively. Problems considered in this project will primarily be studied within the context of the linear model and the Gaussian location model. Methods will be assessed by decision theoretic-like criteria (e.g. asymptotic minimaxity). A null model based on dense (non-sparse) signals and dense estimation and prediction methods will be developed and thoroughly studied. This will provide a rich framework for sparsity testing, where the aim is to identify settings in which sparse methods are likely to be successful. Specific sparsity testing procedures will be proposed and analyzed. High-dimensional data analysis is one of the most active areas of current statistical research. Much of this research has been driven by technological advances that have enabled researchers to collect vast datasets with relative ease in a variety of scientific disciplines, including astrophysics, geological sciences, molecular biology, and genomics. In high-dimensional datasets, many features are measured for each unit of observation (e.g. thousands of gene expression levels may be measured for each individual in a genomic study). Sparsity plays a major role in much of the research on high-dimensional data analysis. Broadly speaking, sparsity measures the degree to which a specified outcome may be described by relatively few features. Sparse methods for high-dimensional data analysis attempt to leverage sparsity in the underlying dataset and have proven to be very effective in many applications, especially in engineering and signal processing. On the other hand, the performance of sparse methods has been more mixed in other important applications where high-dimensional data are abundant, such as genomics. In this project, the investigator will develop statistical methods for characterizing and identifying situations where sparse methods can be successfully applied. This will be achieved by developing tools for determining the level of sparsity in high-dimensional datasets. These methods, when applied to a given dataset, will help researchers determine the validity of subsequent statistical analyses and the potential benefits of using sparse methods for these analyses. This research is likely to have significant implications for understanding reproducibility in high-dimensional data analysis and broad applications in the analysis of genomic data. The methods developed during the course of this project will be utilized in collaborative work with highly experienced researchers in genomics.
|Effective start/end date||8/1/12 → 7/31/15|
- National Science Foundation (NSF)