Baselines and bigrams: Simple, good sentiment and topic classification

Sida Wang, Christopher D. Manning

Research output: Chapter in Book/Report/Conference proceedingConference contribution

671 Scopus citations

Abstract

Variants of Naive Bayes (NB) and Support Vector Machines (SVM) are often used as baseline methods for text classification, but their performance varies greatly depending on the model variant, features used and task/ dataset. We show that: (i) the inclusion of word bigram features gives consistent gains on sentiment analysis tasks; (ii) for short snippet sentiment tasks, NB actually does better than SVMs (while for longer documents the opposite result holds); (iii) a simple but novel SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets. Based on these observations, we identify simple NB and SVM variants which outperform most published results on sentiment analysis datasets, sometimes providing a new state-of-the-art performance level.

Original languageAmerican English
Title of host publication50th Annual Meeting of the Association for Computational Linguistics, ACL 2012 - Proceedings of the Conference
Pages90-94
Number of pages5
StatePublished - Dec 1 2012
Event50th Annual Meeting of the Association for Computational Linguistics, ACL 2012 - Jeju Island, Korea, Republic of
Duration: Jul 8 2012Jul 14 2012

Publication series

Name50th Annual Meeting of the Association for Computational Linguistics, ACL 2012 - Proceedings of the Conference
Volume2

Conference

Conference50th Annual Meeting of the Association for Computational Linguistics, ACL 2012
Country/TerritoryKorea, Republic of
CityJeju Island
Period7/8/127/14/12

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Software

Cite this