Information-theoretic approaches to SVM feature selection for metagenome read classification

Elaine Garbarine, Joseph Depasquale, Vinay Gadia, Robi Polikar, Gail Rosen

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N = 6 for all taxonomic levels.

Original languageEnglish (US)
Pages (from-to)199-209
Number of pages11
JournalComputational Biology and Chemistry
Volume35
Issue number3
DOIs
StatePublished - Jun 1 2011

Fingerprint

Metagenome
Feature Selection
Feature extraction
Redundancy
TF-IDF
Classifiers
DNA sequences
Information theory
Classifier
Information Theory
Metagenomics
Support vector machines
Subset
Data Mining
Text Mining
Genes
Mutual Information
DNA Sequence Analysis
DNA Sequence
Genomics

All Science Journal Classification (ASJC) codes

  • Computational Mathematics
  • Structural Biology
  • Biochemistry
  • Organic Chemistry

Cite this

Garbarine, Elaine ; Depasquale, Joseph ; Gadia, Vinay ; Polikar, Robi ; Rosen, Gail. / Information-theoretic approaches to SVM feature selection for metagenome read classification. In: Computational Biology and Chemistry. 2011 ; Vol. 35, No. 3. pp. 199-209.
@article{16d1554183624da5ada5d05832631b2b,
title = "Information-theoretic approaches to SVM feature selection for metagenome read classification",
abstract = "Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N = 6 for all taxonomic levels.",
author = "Elaine Garbarine and Joseph Depasquale and Vinay Gadia and Robi Polikar and Gail Rosen",
year = "2011",
month = "6",
day = "1",
doi = "https://doi.org/10.1016/j.compbiolchem.2011.04.007",
language = "English (US)",
volume = "35",
pages = "199--209",
journal = "Computational Biology and Chemistry",
issn = "1476-9271",
publisher = "Elsevier Limited",
number = "3",

}

Information-theoretic approaches to SVM feature selection for metagenome read classification. / Garbarine, Elaine; Depasquale, Joseph; Gadia, Vinay; Polikar, Robi; Rosen, Gail.

In: Computational Biology and Chemistry, Vol. 35, No. 3, 01.06.2011, p. 199-209.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Information-theoretic approaches to SVM feature selection for metagenome read classification

AU - Garbarine, Elaine

AU - Depasquale, Joseph

AU - Gadia, Vinay

AU - Polikar, Robi

AU - Rosen, Gail

PY - 2011/6/1

Y1 - 2011/6/1

N2 - Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N = 6 for all taxonomic levels.

AB - Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N = 6 for all taxonomic levels.

UR - http://www.scopus.com/inward/record.url?scp=79959749487&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79959749487&partnerID=8YFLogxK

U2 - https://doi.org/10.1016/j.compbiolchem.2011.04.007

DO - https://doi.org/10.1016/j.compbiolchem.2011.04.007

M3 - Article

VL - 35

SP - 199

EP - 209

JO - Computational Biology and Chemistry

JF - Computational Biology and Chemistry

SN - 1476-9271

IS - 3

ER -