wget

Unlocking Media on the Internet: Playing with WGET

Thursday, April 26th, 2012

When Barcamp Boston rolled around earlier this month, I took the opportunity to give a talk on something I’ve been recently getting into. wget.  And specifically getting media I want with it.

As I planned the talk, its structure evolved into a mini-workshop/talk. For a while I thought about all the awesome tech talks I had been to, and thought, well some of these are smaller, it wouldn’t hurt to get laptops out and take the first steps to make something, say start a node server for the first time, or throw some data into a couchDB. Usually it is the first steps that are the most difficult, so why not do that together?

I chose one of the smaller rooms, with the idea that we could face each other and it would be a discussion. There were a range of attendees, from systems programmers who use it on a daily basis to folks who haven’t touched the command line.

For the most part I wanted to show that a lot of media that shows up in our browser- we can have- we can download.

I presented my talk as a txt file, in VIM zoomed in a whole lot. I thought this was pretty good because you could see what line I was on, and what percentage the way through I was.

My interest in wget grew when I started to use command line tools a lot more in my research work as a scientist, using tools like cat, sort, uniq, wc, grep, quick ways to get a handle on data.

I really love media and music. Then I was shown how to download videos from the Activity monitor in Safari. Similarly the same url can be used with wget. Awesome.

So first I downloaded a video off facebook, using the method described in this post.  Facebook, where all of your precious media- photos and movies- can feel locked down.

Then I described how I used wget -O – url  to pull down a web page, which you can pipe into grep for scraping.  I pulled all of the links to images on the front page of this photographer I recently found, Nicholas Alan Cope, and downloaded each with wget.  It would take a while to do it individually, but with wget, grep and this awesome online regex tool, the process was faster and I learned something awesome.  Basically pipe stdout to

grep -o ‘http://[a-z0-9./A-Z]*_VEDAS[0-9a-b_]*\.jpg’ > image_links

to get the image links.  That huge regex was found by pasting the page source in the online regex tool and iterating on a regular expression until one worked to match all the links.  This process can work with pull links or images from most pages.  With a txt file having a link on each line, you can use wget -i image_links.txt to download from each url.

Finally I showed how to download any track off of soundcloud, using a process similar to the facebook video download.  Find the streaming URL by looking up the url in the Safari Activity Monitor, or Resources under the Chrome developer tools.

The response was wonderful.  People were intruiged- even those who used it regularly didn’t know you could supply wget with a streaming url.  A few asked if this was legal, with a response from another participant that it was fine to do this, there is nothing wrong, you’re only downloading what was already downloaded by your browser, that it is ridiculous to think this is illegal.  A lively political discussion almost began, but the Barcamp closing sessions were starting.

I got through the material, did several live demonstrations, shared my excitement on something seemingly niche, and inspired a lively conversation.  The talk felt pretty great.

 

 

how to download a facebook video using wget

Saturday, February 25th, 2012

In this post I will show you how to go about downloading a video from facebook.  It won’t involve putting the URL in some video downloader.  You will use wget and the browser to download the video.

Background: 

A few weeks I wrote a post on getting tracks from bandcamp as a learning exercise for exploring a web page with the javascript console.

This post will follow along similar lines, but in this case it also highlights the value of learning a little bit of programming. Currently there is no option to download a video posted on facebook. You can download a photo, but not a video you are tagged in.  This became important to me recently because I wanted to pull some videos people took of some artwork I made.  Knowing a little bit of code helped me figure out how the video is being displayed and eventually download it.  I think it is kind of a political thing too- the video is provides some value for facebook, why they would make it difficult to download I dont know, but I do know that it confines personal data my friends have shared with me that I want to liberate and use for myself.

And I also think this is a pretty awesome example of something cool you can do knowing a little code, or rather, a bit about the Terminal and the Browser.

Now get that video

I’m going to be doing this using the Chrome browser and Terminal for mac.  You should be able to do this on Windows or Linux machines, or with Firefox or Safari.

Open up Terminal.  You can find it under /Applications/Utilities/Terminal.app

Get wget.  I use Homebrew, so I just wrote

brew install wget

You can also download it online.

Now open up your browser and go to the video you want.  If you right click it to download, you’ll get something like this.

So you can’t get to it that way.  Pause the video.  We’re going to play it again, but now find the stream url for the video.  This is the url sent to your browser which contains the video.

Instead we’ll observe the network traffic.  In Chrome open up the developer tools.  It is at View-> Developer -> Developer Tools, or just command-option-i.  Click the network tab. Push play on the video.  You should see a lot of activity.  Find the one that ends in “.mp4″.  For me it is at the top.  In Safari you can open up the Activity Monitor, under Window-> Activity.  You are looking for the URL that begins with “http://video.ak.fbcdn.net/…”

click to enlarge

This URL is the one that contains the video.  Right click it, and copy the link address.  It is a really long address.  Move over to Terminal.  Now we’ll download it.  type the following, replacing the italics with the appropriate content.  movie_title is whatever you want to save the movie file as, and in the quotes is the URL we found above.

wget -O movie_title paste-url-here

Hit enter and you’ll see the movie being downloaded.

And thats it! The movie is yours.

I searched a bit online to find a nice tool, and some required the videos to be public, or seemed a bit clunky.  This way was a bit fun because I could dig into the facebook code a little bit, and after I finished I realized it felt really satisfying.  While it wasn’t necessarily programming, it was getting something I wanted with some tools I regularly use while programming on the web, and I figured a great way to get people interested in command line tools or looking under the hood of internet traffic in the browser.

 

A quick exercise web scraping with javascript and bash in bandcamp

Sunday, January 8th, 2012

Here is a quick exercise to explore web scraping in bash on Bandcamp.  We’ll get our hands dirty with some simple console and bash scripting. This is written in January 2012, so things may change between now and when you read this.

example album page

If you take a look at the source on Bandcamp, you notice that all of the music data is located in a Javascript variable.  On your browser there should be a ‘view-source’ option.  In Chrome on the mac it is command-option-u.

 

Notice that in the comment Bandcamp includes information on their terms of service and their stance on applications that pull music off the site. You should probably read those and remember this is only meant as an exercise.  Looking at the FAQ I imagine they would be okay with this tutorial.  (Otherwise, if not okay, message me and I’ll take this down.)

So open up your web console.  I’m using Chrome, so it is command-option-i.  You can then explore the object TralbumData. You can retrieve the track information with TralbumData['trackinfo'].

We’re going to pull the titles and the files.

First get the titles.  You can use `console.log(str) ` to get things printed onto the console. We’ll write a loop over all of the objects and print out the ‘title’ value for each.

for (i in TralbumData['trackinfo']){console.log(TralbumData['trackinfo'][i]['title']);}

Now copy that and paste it in a txt file called titles.

Then we need to get the URLs for the files. We can use the same loop, replacing ‘title’ with ‘file’

for (i in TralbumData['trackinfo']){console.log(TralbumData['trackinfo'][i]['file']);}
 That will give you a list of URLs.  Put that in a file called playlist. Now in Bash we’ll download the file at each URL to the appropriate title.mp3 file. It took some googling to figure this out since I don’t normally do any bash scripting beyond simple command line tools.  Open up Terminal.  First we’ll create an array called name and put each line of titles to an entry in name. In doing this we also have the change the Internal Field Separator.  This is so that when we loop over cat titles we’ll get a new item for each line, rather than for each word.  You can do that with this command:
IFS=’
The second quote is on a new line define the IFS as newlines (IFS=’\n’ doesn’t work [source]). And then run the loop to fill the names array.

j=0;for i in `cat titles`; do names[$j]=$i; j=$(expr $j + 1); done

names contains the track titles.  To get the value out for index 0, you’d use ${name[0]}.  We’ll use that for the output file name.  This time we’ll loop over the entries in playlist and use wget to download the file at each url.

j=0;for i in `cat playlist`; do wget -O ${names[$j]}\.mp3 $i; j=$(expr $j + 1); done

The -O command line switch for wget specifies that the following argument will be the name of the output file.