Generalized similarity kernels for efficient sequence classification

Pavel P. Kuksa, Imdadullah Khan, Vladimir Pavlovic

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

String kernel-based machine learning methods have yielded great success in practical tasks of struc- Tured/sequential data analysis. They often exhibit state-of-the-art performance on tasks such as docu- ment topic elucidation, music genre classification, pro- Tein superfamily and fold prediction. However, typi- cal string kernel methods rely on symbolic Hamming- distance based matching which may not necessarily reect the underlying (e.g., physical) similarity between sequence fragments. In this work we propose a novel computational framework that uses general similarity metrics S(·; ·) and distance-preserving embeddings with string kernels to improve sequence classification. In par- Ticular, we consider two approaches that allow one ei- Ther to incorporate non-Hamming similarity S(·;·) into similarity evaluation by matching only the features that are similar according to S(·; ·) or to retain actual (ap- proximate) similarity/distance scores in similarity eval- uation. An embedding step, a distance-preserving bit- string mapping, is used to effectively capture similarity between otherwise symbolically different sequence ele- ments. We show that it is possible to retain computa- Tional efficiency of string kernels while using this more "precise" measure of similarity. We then demonstrate that on a number of sequence classification tasks such as music, and biological sequence classification, the new method can substantially improve upon state-of-the-art string kernel baselines.

Original languageAmerican English
Title of host publicationProceedings of the 12th SIAM International Conference on Data Mining, SDM 2012
Pages873-882
Number of pages10
StatePublished - 2012
Event12th SIAM International Conference on Data Mining, SDM 2012 - Anaheim, CA, United States
Duration: Apr 26 2012Apr 28 2012

Publication series

NameProceedings of the 12th SIAM International Conference on Data Mining, SDM 2012

Conference

Conference12th SIAM International Conference on Data Mining, SDM 2012
Country/TerritoryUnited States
CityAnaheim, CA
Period4/26/124/28/12

ASJC Scopus subject areas

  • Computer Science Applications

Keywords

  • Classification
  • Sequence analysis
  • String kernels

Fingerprint

Dive into the research topics of 'Generalized similarity kernels for efficient sequence classification'. Together they form a unique fingerprint.

Cite this