Blog

News Article Recommendations Using Bayesian Classification with Apache Mahout

Tuesday, May 1st, 2012

I swung by a talk today at the Boston Predictive Analytics Meetup on using bayesian classification to recommend articles.  The talk was given by Rani Nelken from Outbrain - a company responsible for the recommended articles on the footer of cnn or boston.com news articles.

Basically the process is to first construct a topic ontology- this is a set of classes, or categories, of news articles.  This can certainly change over time, but will require re training of the classifiers (some more on that later).  To calculate the most likely category the method is remarkably simple- built off of Bayes rule. for each category C they calculate the probability P(W|C) which is the occurrences of a word W in C divided by the total words in C.  Then for each article they can calculate a probability it is in a category given the words in the article.  I was surprised at the simplicity of this feature selection (using all of the words rather than a subset), and when asked about it Nelken responded that feature selection is more useful when there are fewer classes.  I imagine also because the texts are several paragraphs, there is enough that a feature vector constructed by term frequencies provides more insightful.

A few details: to correct for rare words add a pseudo count of one, and to prevent underflow (a problem when mutliplying small values) log the numbers and add them.

Initially he used Weka for perfoming the classification- and it is fine for smaller corpuses, but in production did not work so well.  Mahout was much faster and scaled better.  It can also be easily parallelized.

There was a brief mention on Complementary Naive Bayes, which for this type of classifying was more successful.  Digrams and Trigrams could be used, and those increase computational time. Complementary Naive Bayes with Digrams was the most accurate in the examples he presented.

Some difficulties in implementation include training the classifier.  Tagging articles can be expensive, one option is to crowdsource, then verify with experts.  An additional technique is to run several classifiers on a document, and then raise a flag when all of them diverge.  This strategy was described several times during the talk.  When discussing how to improve article tagging, instead of improving single classifiers – it is better to have a diversity of classifiers.  This is exemplified by one of the classifiers Outbrain uses, which is to solely take the URL of a news article.  While this performs poorly on its own, it is successful among an ensemble of classifiers.

Check out the Mahout demo on Github.  It uses several methods – including a feature extractor (unigrams), TFIDF driver (for normalizing common words) and a bayes sum driver to add all of the probabilities.

The talk for the most part focused on supervised learning (using a training step), and when asked about unsupervised learning Nelken mentioned using that technique when separating a larger class into several smaller classes.  Seeing the documents self organize based on TFIDF feature vectors helps highlight emergent categories among a select corpus. When confronting a new corpus one is unfamiliar with, I imagine this could be a useful technique in the first steps of developing the ontology.

On the whole it was good to see the simplicity of using Mahout, as well as hear of its effectiveness in production.

edit 5/2/12 : Here are a couple references Rani sent over Rank Aggregation Methods for the Web and Tackling the Poor Assumptions of Naive Bayes Text Classifiers.

the growth of creative coworking spaces around cambridge and somerville

Monday, April 30th, 2012

Something incredible is brewing in somerville and cambridge, and it has been emerging over the last couple years.  The growth of spaces that encourage building, making, hanging out, performance, art, co-working and tons more. I spend a lot of time between these spaces- initially because I wanted places where I could learn socially and collaborate on projects.  This past weekend I found a few more, and hopping from space to space, I really began to feel the magnitude- the spread- of this movement.

When I first arrived in boston, I was so excited to find Sprout by Davis Square.  A place where people could go and make things, learn from each other, a place to cultivate a community.  That was two years ago, and a lot has happened- that spirit has grown and spread to several other places- to the extent that it almost feels like the most effective use of space in a city is just that- something between a coworking space, performance venue, and community  center.

I was at the Artisan’s Asylum earlier today and folks from Popular Mechanics were there snapping photos and interviewing.    What’s happening there is pretty game changing.  Check out a couple interviews from the space.  There are not only artists and makers there, there are businesses that are growing.  The city of Somerville is pretty excited about the impact the Asylum has- it has brought a lot of jobs to the city.  It’s probably the biggest maker/hackerspace in the world- with tons of affordable spaces and tools, setting an example many can get inspired by.  I find myself there a ton making art as New American Public Art, most recently preparing an interactive magnetic sculpture for Somerville Open Studios.

Last Saturday I swung by East Meets West Bookstore in Central for Make-It market.  I had never been in the space and was amazed to find it was so much more than a bookstore.  It is a performance space- and now is expanding to house several businesses upstairs, like Modk.it, and also a fab lab- a place with tools like lazer cutters or CNC machines that people can come in and use.  It occurred to me that a place like this is so well utilized- both in the day and night it is active.

Later in the day I found myself in [] x [] , a co-working space housing Zeega and goodgood, currently featuring work by a longtime collaborator Joey Asal.  This is one of the newest coworking creative spaces.

Friday I made it to Industry Lab for the first time. I had been looking forward to it for a while, a new coworking space in Inman housing companies including Leaf Labs.  There was an art reception and screening of a couple short films.  The path to the reception in the artist studios led through several rooms of smaller companies- impressive to see how many were there.  From the turnout and energy at the event, there will definitely be more to come.

And beyond that, there is the p.irateship in somerville, blocks away from the Artisan’s Asylum, where the public lab in cambridge is based.  I’ve gone there for synth nights- exploring electronic music and synth building- and every few months there is a synth-in yielding tons of awesome projects.

There is Fringe Union, a co-working space for creative professionals.  I don’t find myself there too often (they throw an awesome winter party) but it is inspiring to hear of products like the cuppow which were created by a collaboration of in house talents.

An energy exists in these places- people who are really creating new things, from artistic to business pursuits.  There is so much I learn from just being in these places- being around these conversations- to knowing who to ask about electronics, or synth building, or understanding interaction design- and then talking about building products, manufacturing and the business side of things.  The learning is all intertwined- its not like reading articles or a book- its something between conversation, and play, it can come in bits or extended sessions.

The growth of these co working creative spaces- it is one of the things I love most about being here.

keep using emoticons, they help me train my classifier

Sunday, April 29th, 2012

Classifying sentiment is a popular topic in natural language processing research, and is also a valuable tool in industry for its applications in understanding what groups think on a broader scale.  One could read text, and determine if it is positive or negative, but for larger corpuses this becomes impractical.  This is where NLP and machine learning comes in handy.

For a good overview of the value of this type of research, check out this O’Reilly Strate talk on the future of NLP.

I tried an approach using Sentiword and movie reviews in the nltk database to classify twitter sentiment. This did not work- perhaps for larger bits of text it would work, but a tweet is too short, and the way language is used is very different between tweets and movie reviews.

Soon I found recent work that used emoticons to classify tweets.  A tweet is labeled positive if it has a happy emoticon like : ) or : D, and negative if it has a sad emoticon.  Here is one paper using this label method with several classifiers (Twitter Sentiment Classification using Distant Supervision; Go, Bhayani, Huang), and another that furthers that research to account for neutral tweets (Twitter as a Corpus for Sentiment Analysis and Opinion Mining; Pak, Paroubek).

What’s kind of cool about this method is that it can work on any language where the emoticons are the same, generally  : ) for happy and : ( for sad.  The method is called distant supervision with noisy labels, since the emoticons are not a completely accurate labeler, but with a ton of training data, which is easy to get with twitter, accuracy can exceed 80% (see papers for details).

It’s kind of funny- at some point using emoticons may have felt juvenile, and perhaps it signified the degradation of language. Now, in this context, emoticons are providing a way to machine learn sentiment, allowing for an understanding of what larger populations of people are thinking.  So keep using emoticons, they are valuable- they help me train my classifier.

Unlocking Media on the Internet: Playing with WGET

Thursday, April 26th, 2012

When Barcamp Boston rolled around earlier this month, I took the opportunity to give a talk on something I’ve been recently getting into. wget.  And specifically getting media I want with it.

As I planned the talk, its structure evolved into a mini-workshop/talk. For a while I thought about all the awesome tech talks I had been to, and thought, well some of these are smaller, it wouldn’t hurt to get laptops out and take the first steps to make something, say start a node server for the first time, or throw some data into a couchDB. Usually it is the first steps that are the most difficult, so why not do that together?

I chose one of the smaller rooms, with the idea that we could face each other and it would be a discussion. There were a range of attendees, from systems programmers who use it on a daily basis to folks who haven’t touched the command line.

For the most part I wanted to show that a lot of media that shows up in our browser- we can have- we can download.

I presented my talk as a txt file, in VIM zoomed in a whole lot. I thought this was pretty good because you could see what line I was on, and what percentage the way through I was.

My interest in wget grew when I started to use command line tools a lot more in my research work as a scientist, using tools like cat, sort, uniq, wc, grep, quick ways to get a handle on data.

I really love media and music. Then I was shown how to download videos from the Activity monitor in Safari. Similarly the same url can be used with wget. Awesome.

So first I downloaded a video off facebook, using the method described in this post.  Facebook, where all of your precious media- photos and movies- can feel locked down.

Then I described how I used wget -O – url  to pull down a web page, which you can pipe into grep for scraping.  I pulled all of the links to images on the front page of this photographer I recently found, Nicholas Alan Cope, and downloaded each with wget.  It would take a while to do it individually, but with wget, grep and this awesome online regex tool, the process was faster and I learned something awesome.  Basically pipe stdout to

grep -o ‘http://[a-z0-9./A-Z]*_VEDAS[0-9a-b_]*\.jpg’ > image_links

to get the image links.  That huge regex was found by pasting the page source in the online regex tool and iterating on a regular expression until one worked to match all the links.  This process can work with pull links or images from most pages.  With a txt file having a link on each line, you can use wget -i image_links.txt to download from each url.

Finally I showed how to download any track off of soundcloud, using a process similar to the facebook video download.  Find the streaming URL by looking up the url in the Safari Activity Monitor, or Resources under the Chrome developer tools.

The response was wonderful.  People were intruiged- even those who used it regularly didn’t know you could supply wget with a streaming url.  A few asked if this was legal, with a response from another participant that it was fine to do this, there is nothing wrong, you’re only downloading what was already downloaded by your browser, that it is ridiculous to think this is illegal.  A lively political discussion almost began, but the Barcamp closing sessions were starting.

I got through the material, did several live demonstrations, shared my excitement on something seemingly niche, and inspired a lively conversation.  The talk felt pretty great.

 

 

kibou by Kami Kapnobatai

Sunday, April 15th, 2012

this track is beautiful

The layered samples relax, and the heavy ducking gives it an edge. I heard it a few months ago and just came back to it.

researching leadership tips and idea cultivation

Wednesday, April 11th, 2012

I decided to get back into reading up on high-level leadership things, in comparison to the low-level technical stuff like learning languages and techniques. As a complement – to assist in idea formation, inspiration and development. I did this a lot a few years ago, then things felt too repetitive (design thinking, iterative design, etc) so I stopped.

Here are some sources I am excited to get into, or return to.

http://startmaking.com/ Start Making is a series of courses, one on one style, about different facets of entrepreneurship.

http://ecorner.stanford.edu/  E Corner at Stanford is a series of talks by entrepreneurs who come speak to the Stanford Tech Ventures class.  I listened to these years ago and they gave me a good feel of trends in companies and leadership practices.

http://prote.in/profiles/ Protein Profiles looks to be a series of talks by creatives or technologies about creativity.

Today I checked out this article on leadership and Apple, there have been tons. I’ll apply to it its own advice, simplifying, and coarsen it to some main points, which are deeply tied together. Focus. Write down everything you need to do and pick the top four.  Focus your efforts, focus what you are making to that which is most important.  If you apply this to products, you get the next one, Simplify. Find the essence of what you are creating and dive into that.  This sounds very Minimum Viable Product-ish.  Take that essence, and cultivate it, don’t be distracted by profits, i.e. Product over Profits. Jobs described that companies lose their growth, their connection to users when biz managers take over and shift the focus to profits. I think the final tip of the article is best for the very seeds of ideas, Stay Hungry Stay Foolish, looking at the world with an idea of what is possible that is incredible and knowing you can do it.  Dream. Make. IDEO describes that idea in their innovation process in part by the observation, or anthropologist traits- observing, watching, being attentive to what could be made better.  When you have this mindset, opportunities expose themselves.

video: Urban Bloom on the Marquee

Wednesday, March 28th, 2012

A beautiful video clip by Laura Bradford of my piece Urban Bloom. It is pretty high quality so you can watch it full screen.

video: thought follows action, from TEDxSomerville

Sunday, March 25th, 2012

I finished editing the video of thought follows action, the interactive public art we made for TEDx Somerville. Here it is:

 

open call for WHIRL at the together festival

Sunday, March 25th, 2012


**OPEN CALL for INTERACTIVE (SOUND) ART to be showcased at WHIRL**


Hey all!


Joey and I will be hosting a WHIRL event at Machine on Saturday, April 7th.  The event will be free, and part of the Together Festival of electronic music.  It will be a mix of awesome interactive art and meetings of cool people.   We want to showcase some of the sweet interactive art around cambridge, boston, and somerville, with a focus towards interactive sound art, so like musical instruments you’ve made, wearable sound stuff, but we’re also pumped about cool projection visualizations that respond to you, because those are fun to dance with.


We’ll be in the pool room of Machine, which is perfect since it has tons of tables to set things up on.  If you’ve been working on a project you want to prototype, or you have something finished you’d like to see in action again, definitely contact us, we’d love to present your work.


sweet examples from the past:


*umbrella with microphones that sample and headphones that play the samples all around you
*kinect dubstep, and crazy funhouse kinect
*Awesome interactive computer vision stuff
*LED grids that respond to your breath
*Arduino synths


Rough schedule:


6 to 8 : the art/instruments will be showcased around machine.
8 to 9 : local electronic act performs (merging with the interactive art experience)


More info:: WHIRL is a multimedia event to quench our desire for beautiful works of art in any medium and ecstatic electronic experiences produced by local artists.  Its something we can do together.  Contact whichlight at g mail dot com if you have any questions or would like to help out. thanks <3!


spread the word! to your buddies who play with sound stuff, dig heavy dance beats, or like hi fiving local electro musicians and artists.  Again, shoot us a message if you would like to present your work.


KAWAN!


Check out the Together blog post about WHIRL

bringing public art to TEDx somerville

Saturday, March 3rd, 2012

image of the spheres

Recently, I have been working with Dan and Bevan to build an interactive public art piece, Thought Follows Action, to be displayed in front of the Armory at TEDx Somerville.

We’re bringing public art to TEDx, and are excited to share it with you. TEDx has been an inspiring initiative, to take the energizing ideas and format of TED and decentralize to make it local focused and charge up communities.

At New American Public Art, we love the idea that Public Space can be used as a medium of conversation and discourse through interactive art.

This piece is PUBLIC. So even if you don’t have a pass to the event, you are still invited to come explore the work, tell us your thoughts, and even better, tell us what you’re excited to see grow around Somerville.

The piece will be right outside the Somerville Armory, 191 Highland Avenue, Somerville, MA, on March 4th, 2012, from 10am to 6:30 pm.

What excites me in particular about this piece is that it provides an interface between conference attendees and anyone in the public who wishes to participate, regardless of whether or not they have a ticket, opening up the discourse of the conference by making it more accessible.  It is in the public space, so anyone can go to it.

I love thinking about the intersection of art and community- in particular how art can be used to grow a community.  In this case, we’re creating a piece at the entrance of the event, inviting attendees, people passing by, and those  who heard of the piece, to come play with it.  This is something people can interact with together, this is something that can generate conversations opening up a platform to share other ideas and projects.

For more information: Facebook event page, new american public art, tumblr page.