predictions based on temporality of web content

I came across the work of Christopher Ahlberg, in particular his recent venture Recorded Future based here in Cambridge.  It looks like a ton of analysis and parsing of both structured data (ex. finance) and unstructured (ex. articles and blog posts).  In this brief sample of his O’Reilly Strata talk, Ahlberg reveals a bit of the process under the hood.

For unstructured text they parse the text for temporal signals (things like next week or tomorrow), and then index information based on the time of the content, a temporal index.  This is pretty intriguing.  So it is not prediction as one my traditionally think – instead of creating projections of the future from past data, I’m looking at what people are collectively saying about the future.  The thought here is that on the macro level, there is some valuable insight there.

I find this a pretty compelling way of coarsening all the data coming from news sources.  Time to start harvesting news RSS feeds…

News Article Recommendations Using Bayesian Classification with Apache Mahout

I swung by a talk today at the Boston Predictive Analytics Meetup on using bayesian classification to recommend articles.  The talk was given by Rani Nelken from Outbrain – a company responsible for the recommended articles on the footer of cnn or news articles.

Basically the process is to first construct a topic ontology- this is a set of classes, or categories, of news articles.  This can certainly change over time, but will require re training of the classifiers (some more on that later).  To calculate the most likely category the method is remarkably simple- built off of Bayes rule. for each category C they calculate the probability P(W|C) which is the occurrences of a word W in C divided by the total words in C.  Then for each article they can calculate a probability it is in a category given the words in the article.  I was surprised at the simplicity of this feature selection (using all of the words rather than a subset), and when asked about it Nelken responded that feature selection is more useful when there are fewer classes.  I imagine also because the texts are several paragraphs, there is enough that a feature vector constructed by term frequencies provides more insightful.

A few details: to correct for rare words add a pseudo count of one, and to prevent underflow (a problem when mutliplying small values) log the numbers and add them.

Initially he used Weka for perfoming the classification- and it is fine for smaller corpuses, but in production did not work so well.  Mahout was much faster and scaled better.  It can also be easily parallelized.

There was a brief mention on Complementary Naive Bayes, which for this type of classifying was more successful.  Digrams and Trigrams could be used, and those increase computational time. Complementary Naive Bayes with Digrams was the most accurate in the examples he presented.

Some difficulties in implementation include training the classifier.  Tagging articles can be expensive, one option is to crowdsource, then verify with experts.  An additional technique is to run several classifiers on a document, and then raise a flag when all of them diverge.  This strategy was described several times during the talk.  When discussing how to improve article tagging, instead of improving single classifiers – it is better to have a diversity of classifiers.  This is exemplified by one of the classifiers Outbrain uses, which is to solely take the URL of a news article.  While this performs poorly on its own, it is successful among an ensemble of classifiers.

Check out the Mahout demo on Github.  It uses several methods – including a feature extractor (unigrams), TFIDF driver (for normalizing common words) and a bayes sum driver to add all of the probabilities.

The talk for the most part focused on supervised learning (using a training step), and when asked about unsupervised learning Nelken mentioned using that technique when separating a larger class into several smaller classes.  Seeing the documents self organize based on TFIDF feature vectors helps highlight emergent categories among a select corpus. When confronting a new corpus one is unfamiliar with, I imagine this could be a useful technique in the first steps of developing the ontology.

On the whole it was good to see the simplicity of using Mahout, as well as hear of its effectiveness in production.

edit 5/2/12 : Here are a couple references Rani sent over Rank Aggregation Methods for the Web and Tackling the Poor Assumptions of Naive Bayes Text Classifiers.

The Adjacent Possible and Unpredictable Future

Exploring the trajectories of complex systems.

Stuart Kauffman spoke remotely at the NECSI 2011 Winter School. He has made significant contributions to the field of complexity – in particular with his work on self-organization and Boolean networks.  I found the talk most interesting because it took issues commonly discussed, of determinism and evolution, and approached them at a different angle.  How did determinism affect our culture?  How did it affect the way in which we view nature (hinting to the crisis we are in at present)?  The talk itself presents nature as a powerful ‘problem’ solver, from something we tried to harness, to something that will survive whether or not we do.

Continue reading “The Adjacent Possible and Unpredictable Future”