I swung by a talk today at the Boston Predictive Analytics Meetup on using bayesian classification to recommend articles. The talk was given by Rani Nelken from Outbrain - a company responsible for the recommended articles on the footer of cnn or boston.com news articles.
Basically the process is to first construct a topic ontology- this is a set of classes, or categories, of news articles. This can certainly change over time, but will require re training of the classifiers (some more on that later). To calculate the most likely category the method is remarkably simple- built off of Bayes rule. for each category C they calculate the probability P(W|C) which is the occurrences of a word W in C divided by the total words in C. Then for each article they can calculate a probability it is in a category given the words in the article. I was surprised at the simplicity of this feature selection (using all of the words rather than a subset), and when asked about it Nelken responded that feature selection is more useful when there are fewer classes. I imagine also because the texts are several paragraphs, there is enough that a feature vector constructed by term frequencies provides more insightful.
A few details: to correct for rare words add a pseudo count of one, and to prevent underflow (a problem when mutliplying small values) log the numbers and add them.
Initially he used Weka for perfoming the classification- and it is fine for smaller corpuses, but in production did not work so well. Mahout was much faster and scaled better. It can also be easily parallelized.
There was a brief mention on Complementary Naive Bayes, which for this type of classifying was more successful. Digrams and Trigrams could be used, and those increase computational time. Complementary Naive Bayes with Digrams was the most accurate in the examples he presented.
Some difficulties in implementation include training the classifier. Tagging articles can be expensive, one option is to crowdsource, then verify with experts. An additional technique is to run several classifiers on a document, and then raise a flag when all of them diverge. This strategy was described several times during the talk. When discussing how to improve article tagging, instead of improving single classifiers – it is better to have a diversity of classifiers. This is exemplified by one of the classifiers Outbrain uses, which is to solely take the URL of a news article. While this performs poorly on its own, it is successful among an ensemble of classifiers.
Check out the Mahout demo on Github. It uses several methods – including a feature extractor (unigrams), TFIDF driver (for normalizing common words) and a bayes sum driver to add all of the probabilities.
The talk for the most part focused on supervised learning (using a training step), and when asked about unsupervised learning Nelken mentioned using that technique when separating a larger class into several smaller classes. Seeing the documents self organize based on TFIDF feature vectors helps highlight emergent categories among a select corpus. When confronting a new corpus one is unfamiliar with, I imagine this could be a useful technique in the first steps of developing the ontology.
On the whole it was good to see the simplicity of using Mahout, as well as hear of its effectiveness in production.
edit 5/2/12 : Here are a couple references Rani sent over Rank Aggregation Methods for the Web and Tackling the Poor Assumptions of Naive Bayes Text Classifiers.