Using Scraperwiki to pull data from the Access to Medicine Index

I made my first scraper on scraperwiki! I met Francis, the founder, a few days ago. He has an incredible wealth of information regarding open data, open web, and all sorts of awesome tech organizations and projects. He’d began doing this stuff before there were words to describe them, so something of a pioneer.

Scraperwiki takes code that has some of the shortest lifespans- ‘scrapers’ that pull data off web pages- and puts it in a wiki format to allow updates, collaboration, and discoverability. Basically a great architecture to help structure data on the web.

And so when I got home, I totally had to write a scraper. I chose the access to medecine index. Its not too big, but it would fit well into the mission for The Open Finance Initiative.

We want to collect as much information and data we can regarding corporate sustainability, with the intention to support an open ecosystem for tools that individual investors can use to invest in good companies. (If you know any good datasets, send me a message)

So here’s a start. The access to medicine index on scraperwiki.

Here is the code that I used to build it:

Notice some of the labels are Dutch! (ex. ‘naam’)  I didn’t notice when I originally went on the site, but its based in the Netherlands. Kinda cool how this shows up more clearly in the website source code.

I’ll point out a few things that were super helpful after I figured them out. First the scraperwiki tutorial is great to get started. They use the library ‘lxml.html’.

And this reference for the library helped out for the details and figuring out what methods to use.

So the odd thing about this dataset is that the numbers were tucked into a div tag title parameter:

<div style=”width: 13px” title=”Management: 0.44″></div>
<div style=”width: 10px” title=”Influence: 0.33″></div>
<div style=”width: 28px” title=”R&D: 0.94″></div>
<div style=”width: 22px” title=”Pricing: 0.74″></div>
<div style=”width: 14px” title=”Patenting: 0.48″></div>
<div style=”width: 12px” title=”Capability: 0.39″></div>
<div style=”width: 13px” title=”Philanthropy: 0.43″></div>
The name of the companies were easy to get to, but to get at these numbers I traversed to the right <div> and used a regular expression on the <div> as a string.

The method ‘cssselect’ returns a list of child nodes of any parent that fits the specified argument:

parent.cssselect(‘childtag‘).

So for

table = root.cssselect(“tbody”)[0]

you get the first item in root that has the tag “tbody”. ‘findclass’ is pretty useful too, this pulls out whatever node within an element fits the class name.

This next part is pretty cool: if you have a list of some DOM element (so ‘tr’, or ‘div’) you can set up a loop to go through the child elements. Thats basically what I do for each td element, i go through the div’s in it with a loop, then I convert each div to a string to do some precise scraping with ‘tostring’ as follows:

‘string = lxml.html.tostring(d)’.

At that point, since its a string, I can apply some regular expression methods to pull out what I want.

The scraperwiki module is something specific to when you code in the scraperwiki cloud. You can use it to save your data into a table, among other things I have yet to explore.  I’m excited about using it for OFI in collaboratively collecting and maintaining the data for open use.  If this sounds cool to you, hop on the github or the Mozilla drumbeat page for OFI.

Leave a Reply