Unlocking Media on the Internet: Playing with WGET

When Barcamp Boston rolled around earlier this month, I took the opportunity to give a talk on something I’ve been recently getting into. wget.  And specifically getting media I want with it.

As I planned the talk, its structure evolved into a mini-workshop/talk. For a while I thought about all the awesome tech talks I had been to, and thought, well some of these are smaller, it wouldn’t hurt to get laptops out and take the first steps to make something, say start a node server for the first time, or throw some data into a couchDB. Usually it is the first steps that are the most difficult, so why not do that together?

I chose one of the smaller rooms, with the idea that we could face each other and it would be a discussion. There were a range of attendees, from systems programmers who use it on a daily basis to folks who haven’t touched the command line.

For the most part I wanted to show that a lot of media that shows up in our browser- we can have- we can download.

I presented my talk as a txt file, in VIM zoomed in a whole lot. I thought this was pretty good because you could see what line I was on, and what percentage the way through I was.

My interest in wget grew when I started to use command line tools a lot more in my research work as a scientist, using tools like cat, sort, uniq, wc, grep, quick ways to get a handle on data.

I really love media and music. Then I was shown how to download videos from the Activity monitor in Safari. Similarly the same url can be used with wget. Awesome.

So first I downloaded a video off facebook, using the method described in this post.  Facebook, where all of your precious media- photos and movies- can feel locked down.

Then I described how I used wget -O – url  to pull down a web page, which you can pipe into grep for scraping.  I pulled all of the links to images on the front page of this photographer I recently found, Nicholas Alan Cope, and downloaded each with wget.  It would take a while to do it individually, but with wget, grep and this awesome online regex tool, the process was faster and I learned something awesome.  Basically pipe stdout to

grep -o ‘http://[a-z0-9./A-Z]*_VEDAS[0-9a-b_]*\.jpg’ > image_links

to get the image links.  That huge regex was found by pasting the page source in the online regex tool and iterating on a regular expression until one worked to match all the links.  This process can work with pull links or images from most pages.  With a txt file having a link on each line, you can use wget -i image_links.txt to download from each url.

Finally I showed how to download any track off of soundcloud, using a process similar to the facebook video download.  Find the streaming URL by looking up the url in the Safari Activity Monitor, or Resources under the Chrome developer tools.

The response was wonderful.  People were intruiged- even those who used it regularly didn’t know you could supply wget with a streaming url.  A few asked if this was legal, with a response from another participant that it was fine to do this, there is nothing wrong, you’re only downloading what was already downloaded by your browser, that it is ridiculous to think this is illegal.  A lively political discussion almost began, but the Barcamp closing sessions were starting.

I got through the material, did several live demonstrations, shared my excitement on something seemingly niche, and inspired a lively conversation.  The talk felt pretty great.