Here is a quick exercise to explore web scraping in bash on Bandcamp. We’ll get our hands dirty with some simple console and bash scripting. This is written in January 2012, so things may change between now and when you read this.
If you take a look at the source on Bandcamp, you notice that all of the music data is located in a Javascript variable. On your browser there should be a ‘view-source’ option. In Chrome on the mac it is command-option-u.
Notice that in the comment Bandcamp includes information on their terms of service and their stance on applications that pull music off the site. You should probably read those and remember this is only meant as an exercise. Looking at the FAQ I imagine they would be okay with this tutorial. (Otherwise, if not okay, message me and I’ll take this down.)
So open up your web console. I’m using Chrome, so it is command-option-i. You can then explore the object TralbumData. You can retrieve the track information with TralbumData['trackinfo'].
We’re going to pull the titles and the files.
First get the titles. You can use `console.log(str) ` to get things printed onto the console. We’ll write a loop over all of the objects and print out the ‘title’ value for each.
for (i in TralbumData['trackinfo']){console.log(TralbumData['trackinfo'][i]['title']);}
Now copy that and paste it in a txt file called titles.
Then we need to get the URLs for the files. We can use the same loop, replacing ‘title’ with ‘file’
for (i in TralbumData['trackinfo']){console.log(TralbumData['trackinfo'][i]['file']);}
IFS=’
‘
j=0;for i in `cat titles`; do names[$j]=$i; j=$(expr $j + 1); done
j=0;for i in `cat playlist`; do wget -O ${names[$j]}\.mp3 $i; j=$(expr $j + 1); done
The -O command line switch for wget specifies that the following argument will be the name of the output file.



