scraping

A quick exercise web scraping with javascript and bash in bandcamp

Sunday, January 8th, 2012

Here is a quick exercise to explore web scraping in bash on Bandcamp.  We’ll get our hands dirty with some simple console and bash scripting. This is written in January 2012, so things may change between now and when you read this.

example album page

If you take a look at the source on Bandcamp, you notice that all of the music data is located in a Javascript variable.  On your browser there should be a ‘view-source’ option.  In Chrome on the mac it is command-option-u.

 

Notice that in the comment Bandcamp includes information on their terms of service and their stance on applications that pull music off the site. You should probably read those and remember this is only meant as an exercise.  Looking at the FAQ I imagine they would be okay with this tutorial.  (Otherwise, if not okay, message me and I’ll take this down.)

So open up your web console.  I’m using Chrome, so it is command-option-i.  You can then explore the object TralbumData. You can retrieve the track information with TralbumData['trackinfo'].

We’re going to pull the titles and the files.

First get the titles.  You can use `console.log(str) ` to get things printed onto the console. We’ll write a loop over all of the objects and print out the ‘title’ value for each.

for (i in TralbumData['trackinfo']){console.log(TralbumData['trackinfo'][i]['title']);}

Now copy that and paste it in a txt file called titles.

Then we need to get the URLs for the files. We can use the same loop, replacing ‘title’ with ‘file’

for (i in TralbumData['trackinfo']){console.log(TralbumData['trackinfo'][i]['file']);}
 That will give you a list of URLs.  Put that in a file called playlist. Now in Bash we’ll download the file at each URL to the appropriate title.mp3 file. It took some googling to figure this out since I don’t normally do any bash scripting beyond simple command line tools.  Open up Terminal.  First we’ll create an array called name and put each line of titles to an entry in name. In doing this we also have the change the Internal Field Separator.  This is so that when we loop over cat titles we’ll get a new item for each line, rather than for each word.  You can do that with this command:
IFS=’
The second quote is on a new line define the IFS as newlines (IFS=’\n’ doesn’t work [source]). And then run the loop to fill the names array.

j=0;for i in `cat titles`; do names[$j]=$i; j=$(expr $j + 1); done

names contains the track titles.  To get the value out for index 0, you’d use ${name[0]}.  We’ll use that for the output file name.  This time we’ll loop over the entries in playlist and use wget to download the file at each url.

j=0;for i in `cat playlist`; do wget -O ${names[$j]}\.mp3 $i; j=$(expr $j + 1); done

The -O command line switch for wget specifies that the following argument will be the name of the output file.

 

wow this wget command is pretty ill

Friday, December 9th, 2011

I found this command today when I was looking for a way to download a bunch of files from a web page quickly. Basically this will go to every link and download it.  It will follow links 1 layer.  More details on it here.

wget -r -l1 -H -t1 <url>

So for instance you can run that on a page of your favorite tumblr and get all the images and text posted. That I used it twice after just learning it (once for the files download, again for images download) makes it feel quite powerful.

Using Scraperwiki to pull data from the Access to Medicine Index

Friday, June 24th, 2011

I made my first scraper on scraperwiki! I met Francis, the founder, a few days ago. He has an incredible wealth of information regarding open data, open web, and all sorts of awesome tech organizations and projects. He’d began doing this stuff before there were words to describe them, so something of a pioneer.

Scraperwiki takes code that has some of the shortest lifespans- ‘scrapers’ that pull data off web pages- and puts it in a wiki format to allow updates, collaboration, and discoverability. Basically a great architecture to help structure data on the web.

And so when I got home, I totally had to write a scraper. I chose the access to medecine index. Its not too big, but it would fit well into the mission for The Open Finance Initiative.

We want to collect as much information and data we can regarding corporate sustainability, with the intention to support an open ecosystem for tools that individual investors can use to invest in good companies. (If you know any good datasets, send me a message)

So here’s a start. The access to medicine index on scraperwiki.

Here is the code that I used to build it:

Notice some of the labels are Dutch! (ex. ‘naam’)  I didn’t notice when I originally went on the site, but its based in the Netherlands. Kinda cool how this shows up more clearly in the website source code.

I’ll point out a few things that were super helpful after I figured them out. First the scraperwiki tutorial is great to get started. They use the library ‘lxml.html’.

And this reference for the library helped out for the details and figuring out what methods to use.

So the odd thing about this dataset is that the numbers were tucked into a div tag title parameter:

<div style=”width: 13px” title=”Management: 0.44″></div>
<div style=”width: 10px” title=”Influence: 0.33″></div>
<div style=”width: 28px” title=”R&D: 0.94″></div>
<div style=”width: 22px” title=”Pricing: 0.74″></div>
<div style=”width: 14px” title=”Patenting: 0.48″></div>
<div style=”width: 12px” title=”Capability: 0.39″></div>
<div style=”width: 13px” title=”Philanthropy: 0.43″></div>
The name of the companies were easy to get to, but to get at these numbers I traversed to the right <div> and used a regular expression on the <div> as a string.

The method ‘cssselect’ returns a list of child nodes of any parent that fits the specified argument:

parent.cssselect(‘childtag‘).

So for

table = root.cssselect(“tbody”)[0]

you get the first item in root that has the tag “tbody”. ‘findclass’ is pretty useful too, this pulls out whatever node within an element fits the class name.

This next part is pretty cool: if you have a list of some DOM element (so ‘tr’, or ‘div’) you can set up a loop to go through the child elements. Thats basically what I do for each td element, i go through the div’s in it with a loop, then I convert each div to a string to do some precise scraping with ‘tostring’ as follows:

‘string = lxml.html.tostring(d)’.

At that point, since its a string, I can apply some regular expression methods to pull out what I want.

The scraperwiki module is something specific to when you code in the scraperwiki cloud. You can use it to save your data into a table, among other things I have yet to explore.  I’m excited about using it for OFI in collaboratively collecting and maintaining the data for open use.  If this sounds cool to you, hop on the github or the Mozilla drumbeat page for OFI.