TY - GEN
T1 - Generalized similarity kernels for efficient sequence classification
AU - Kuksa, Pavel P.
AU - Khan, Imdadullah
AU - Pavlovic, Vladimir
PY - 2012
Y1 - 2012
N2 - String kernel-based machine learning methods have yielded great success in practical tasks of struc- Tured/sequential data analysis. They often exhibit state-of-the-art performance on tasks such as docu- ment topic elucidation, music genre classification, pro- Tein superfamily and fold prediction. However, typi- cal string kernel methods rely on symbolic Hamming- distance based matching which may not necessarily reect the underlying (e.g., physical) similarity between sequence fragments. In this work we propose a novel computational framework that uses general similarity metrics S(·; ·) and distance-preserving embeddings with string kernels to improve sequence classification. In par- Ticular, we consider two approaches that allow one ei- Ther to incorporate non-Hamming similarity S(·;·) into similarity evaluation by matching only the features that are similar according to S(·; ·) or to retain actual (ap- proximate) similarity/distance scores in similarity eval- uation. An embedding step, a distance-preserving bit- string mapping, is used to effectively capture similarity between otherwise symbolically different sequence ele- ments. We show that it is possible to retain computa- Tional efficiency of string kernels while using this more "precise" measure of similarity. We then demonstrate that on a number of sequence classification tasks such as music, and biological sequence classification, the new method can substantially improve upon state-of-the-art string kernel baselines.
AB - String kernel-based machine learning methods have yielded great success in practical tasks of struc- Tured/sequential data analysis. They often exhibit state-of-the-art performance on tasks such as docu- ment topic elucidation, music genre classification, pro- Tein superfamily and fold prediction. However, typi- cal string kernel methods rely on symbolic Hamming- distance based matching which may not necessarily reect the underlying (e.g., physical) similarity between sequence fragments. In this work we propose a novel computational framework that uses general similarity metrics S(·; ·) and distance-preserving embeddings with string kernels to improve sequence classification. In par- Ticular, we consider two approaches that allow one ei- Ther to incorporate non-Hamming similarity S(·;·) into similarity evaluation by matching only the features that are similar according to S(·; ·) or to retain actual (ap- proximate) similarity/distance scores in similarity eval- uation. An embedding step, a distance-preserving bit- string mapping, is used to effectively capture similarity between otherwise symbolically different sequence ele- ments. We show that it is possible to retain computa- Tional efficiency of string kernels while using this more "precise" measure of similarity. We then demonstrate that on a number of sequence classification tasks such as music, and biological sequence classification, the new method can substantially improve upon state-of-the-art string kernel baselines.
KW - Classification
KW - Sequence analysis
KW - String kernels
UR - http://www.scopus.com/inward/record.url?scp=84880201928&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84880201928&partnerID=8YFLogxK
M3 - Conference contribution
SN - 9781611972320
T3 - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012
SP - 873
EP - 882
BT - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012
T2 - 12th SIAM International Conference on Data Mining, SDM 2012
Y2 - 26 April 2012 through 28 April 2012
ER -