keep using emoticons, they help me train my classifier

Classifying sentiment is a popular topic in natural language processing research, and is also a valuable tool in industry for its applications in understanding what groups think on a broader scale.  One could read text, and determine if it is positive or negative, but for larger corpuses this becomes impractical.  This is where NLP and machine learning comes in handy.

For a good overview of the value of this type of research, check out this O’Reilly Strate talk on the future of NLP.

I tried an approach using Sentiword and movie reviews in the nltk database to classify twitter sentiment. This did not work- perhaps for larger bits of text it would work, but a tweet is too short, and the way language is used is very different between tweets and movie reviews.

Soon I found recent work that used emoticons to classify tweets.  A tweet is labeled positive if it has a happy emoticon like : ) or : D, and negative if it has a sad emoticon.  Here is one paper using this label method with several classifiers (Twitter Sentiment Classification using Distant Supervision; Go, Bhayani, Huang), and another that furthers that research to account for neutral tweets (Twitter as a Corpus for Sentiment Analysis and Opinion Mining; Pak, Paroubek).

What’s kind of cool about this method is that it can work on any language where the emoticons are the same, generally  : ) for happy and : ( for sad.  The method is called distant supervision with noisy labels, since the emoticons are not a completely accurate labeler, but with a ton of training data, which is easy to get with twitter, accuracy can exceed 80% (see papers for details).

It’s kind of funny- at some point using emoticons may have felt juvenile, and perhaps it signified the degradation of language. Now, in this context, emoticons are providing a way to machine learn sentiment, allowing for an understanding of what larger populations of people are thinking.  So keep using emoticons, they are valuable- they help me train my classifier.