analytics

priceonomics – data mining and analytics with public data on the web

Saturday, May 5th, 2012

I came across the startup priceonomics and I love how it combines a clear expression of value with simplicity.  Search a product you like and it shows you the price range and distribution, price change over time, and also the option to receive an e-mail regarding the product if it appears at a good price.

What I love about this startup immediately:

*Data mining: these prices are pulled from all over the web

*Analysis: the distribution of prices is presented, and used to determine when there is a good deal for a consumer

*Design: all of this is under a simple to use interface- making the tool very accessible and easy to explore with.

*No sign up: I don’t have to sign up anywhere to use this. There is no social element. I don’t have to share. This is so refreshing.

I use some of these skills on a day to day: (1) getting public data from the web (scraping, APIs, or database download), and  (2) running some sort of data analysis.  Yes- I realize that’s very vague, but what I find most compelling is that this startup combines those skills with a simple interface and creates a consumer product out of it. What other products would be useful, following a similar approach: combining public data available on the web, analytics, and clear design?  I would love to see some consumer analytic products come out that are not open ended, that have some target sector, and that make use of publicly available data.

In some ways this reminds me of the sophisticated https://www.decide.com/, which uses machine learning of tons of consumer electronics data to help you know the best time to buy a product you’re interested in.  See this NYTimes Bits article  for more details on Decide.  The company was founded by Oren Etzioni  who is also behind Farecast (which was acquired by Bing Travel)- a company that helped you determine the best time to buy plane tickets.

What other sectors or problems could benefit with this type of product? I imagine it begins with thinking of where we would need it on a day to day- the sort of data we collect, wade through, and decide upon on a day to day.  Where is it frustrating and where can it be made simpler?  With priceonomics, on buying most products, with Decide, on consumer electronics, and with Farecast, on plane tickets.  I immediately start brainstorming in potential areas of opportunity- as jump off points- nutrition, health, education, and jobs.

News Article Recommendations Using Bayesian Classification with Apache Mahout

Tuesday, May 1st, 2012

I swung by a talk today at the Boston Predictive Analytics Meetup on using bayesian classification to recommend articles.  The talk was given by Rani Nelken from Outbrain - a company responsible for the recommended articles on the footer of cnn or boston.com news articles.

Basically the process is to first construct a topic ontology- this is a set of classes, or categories, of news articles.  This can certainly change over time, but will require re training of the classifiers (some more on that later).  To calculate the most likely category the method is remarkably simple- built off of Bayes rule. for each category C they calculate the probability P(W|C) which is the occurrences of a word W in C divided by the total words in C.  Then for each article they can calculate a probability it is in a category given the words in the article.  I was surprised at the simplicity of this feature selection (using all of the words rather than a subset), and when asked about it Nelken responded that feature selection is more useful when there are fewer classes.  I imagine also because the texts are several paragraphs, there is enough that a feature vector constructed by term frequencies provides more insightful.

A few details: to correct for rare words add a pseudo count of one, and to prevent underflow (a problem when mutliplying small values) log the numbers and add them.

Initially he used Weka for perfoming the classification- and it is fine for smaller corpuses, but in production did not work so well.  Mahout was much faster and scaled better.  It can also be easily parallelized.

There was a brief mention on Complementary Naive Bayes, which for this type of classifying was more successful.  Digrams and Trigrams could be used, and those increase computational time. Complementary Naive Bayes with Digrams was the most accurate in the examples he presented.

Some difficulties in implementation include training the classifier.  Tagging articles can be expensive, one option is to crowdsource, then verify with experts.  An additional technique is to run several classifiers on a document, and then raise a flag when all of them diverge.  This strategy was described several times during the talk.  When discussing how to improve article tagging, instead of improving single classifiers – it is better to have a diversity of classifiers.  This is exemplified by one of the classifiers Outbrain uses, which is to solely take the URL of a news article.  While this performs poorly on its own, it is successful among an ensemble of classifiers.

Check out the Mahout demo on Github.  It uses several methods – including a feature extractor (unigrams), TFIDF driver (for normalizing common words) and a bayes sum driver to add all of the probabilities.

The talk for the most part focused on supervised learning (using a training step), and when asked about unsupervised learning Nelken mentioned using that technique when separating a larger class into several smaller classes.  Seeing the documents self organize based on TFIDF feature vectors helps highlight emergent categories among a select corpus. When confronting a new corpus one is unfamiliar with, I imagine this could be a useful technique in the first steps of developing the ontology.

On the whole it was good to see the simplicity of using Mahout, as well as hear of its effectiveness in production.

edit 5/2/12 : Here are a couple references Rani sent over Rank Aggregation Methods for the Web and Tackling the Poor Assumptions of Naive Bayes Text Classifiers.

Internet analytics for yourself

Wednesday, July 6th, 2011

I started playing with sqlite3 recently, its a super easy database to learn and is used both in mobile and web apps. I just found that your Google history is also saved as a sqlite3 database.

For me its under

/Users/<user name>/Library/Application Support/Google/Chrome/Default

You can get there using Terminal.  Open it up and type

cd /Users/<user name>/Library/Application Support/Google/Chrome/Default

substituting <user name> with your user name. Type in

sqlite3 History

and then once you’re in the database, hit .tables, and you can see all of the tables that are saved.

These include: downloads, presentation, urls, keyword_search_terms  segment_usage    visit_source, meta, segments, visits.  To look into any of those type ‘select * from <table>;’ so for example ‘select * from urls;’ and you’ll see the URL, the id, the time, the name, the visit count.  You’ll have to exit Chrome to be able to access the database, otherwise it will say ‘this database is locked’ or something like that.

Its way cool. I love chrome’s UI for looking at your history in the browser (History>>Show Full History).  Its pretty easy to navigate through, and you see your internet path throughout the day.  Now having the data allows you to make your own data tools to visualize and understand where you go on the web, when, and how often.  And you can easily export it.  I can see this being helpful in improving productivity.

For one, you can reflect on your day’s work and see where you were able to focus, and what distracted you.