open finance initiative

download xbrl data from the SEC, steps towards company transparency

Thursday, August 11th, 2011

XBRL stands for extensible business reporting language. Its a data format that companies use in reporting to the SEC.   I recently became interested in it from the point of view that it could help identify sustainable investments, and rate the environmental, social, and governance standing of a company, as ongoing R&D for the open finance initiative.  There is a lot of data that is closed regarding company practices, but with respect to XBRL, there is a ton of data, and its freely available online.

As a brief look of the implications of this data type, XBRL for Dummies describes one quality of XBRL as “A revolution for small investors, the most important shareholder initiative in a decade, and a leveler of the investment playing field”.  Beforehand, it may have been that only research companies with enough resources could collect information on businesses.  Now, XBRL opens this information up to a much wider set of people, most importantly those who don’t have the capital for research, data collection, and data cleaning.

If you want to pull it all, just open up Terminal in your mac (its under /Applications/Utilities/Terminal.app).  Create a folder you want to keep the data in:

mkdir xbrl_data

cd xbrl_data

and then type in the command to pull all of the data

wget ftp://ftp.sec.gov/edgar/monthly/xbrl*

If you don’t have wget, you can download it here.  This will populate that directory with all of the XBRL data from April 2005 to the present, for each month.

If you want to look around at other data the SEC has, type in your browser

ftp.sec.gov

and you can poke around.  You can navigate to the same folder as before and download individual sets by hand as well:

ftp.sec.gov/edgar/monthly/

Currently I’m looking around for something that will make it easier to play around with the files and get an idea of what XBRL is, and how I can use it.  I spent some time looking around for libraries to parse the XBRL data, and found a ruby library xbrlware, a python platform arelle.  If you come across other useful libraries, share them.

update (8/14/11):

If you find yourself installing a lot of UNIX tools on your Mac, Homebrew is clean and easy to use.  To install Homebrew, enter this in Terminal

/usr/bin/ruby -e "$(curl -fsSL https://raw.github.com/gist/323731)"

And then to install wget

brew install wget

Done!

 

Using Scraperwiki to pull data from the Access to Medicine Index

Friday, June 24th, 2011

I made my first scraper on scraperwiki! I met Francis, the founder, a few days ago. He has an incredible wealth of information regarding open data, open web, and all sorts of awesome tech organizations and projects. He’d began doing this stuff before there were words to describe them, so something of a pioneer.

Scraperwiki takes code that has some of the shortest lifespans- ‘scrapers’ that pull data off web pages- and puts it in a wiki format to allow updates, collaboration, and discoverability. Basically a great architecture to help structure data on the web.

And so when I got home, I totally had to write a scraper. I chose the access to medecine index. Its not too big, but it would fit well into the mission for The Open Finance Initiative.

We want to collect as much information and data we can regarding corporate sustainability, with the intention to support an open ecosystem for tools that individual investors can use to invest in good companies. (If you know any good datasets, send me a message)

So here’s a start. The access to medicine index on scraperwiki.

Here is the code that I used to build it:

Notice some of the labels are Dutch! (ex. ‘naam’)  I didn’t notice when I originally went on the site, but its based in the Netherlands. Kinda cool how this shows up more clearly in the website source code.

I’ll point out a few things that were super helpful after I figured them out. First the scraperwiki tutorial is great to get started. They use the library ‘lxml.html’.

And this reference for the library helped out for the details and figuring out what methods to use.

So the odd thing about this dataset is that the numbers were tucked into a div tag title parameter:

<div style=”width: 13px” title=”Management: 0.44″></div>
<div style=”width: 10px” title=”Influence: 0.33″></div>
<div style=”width: 28px” title=”R&D: 0.94″></div>
<div style=”width: 22px” title=”Pricing: 0.74″></div>
<div style=”width: 14px” title=”Patenting: 0.48″></div>
<div style=”width: 12px” title=”Capability: 0.39″></div>
<div style=”width: 13px” title=”Philanthropy: 0.43″></div>
The name of the companies were easy to get to, but to get at these numbers I traversed to the right <div> and used a regular expression on the <div> as a string.

The method ‘cssselect’ returns a list of child nodes of any parent that fits the specified argument:

parent.cssselect(‘childtag‘).

So for

table = root.cssselect(“tbody”)[0]

you get the first item in root that has the tag “tbody”. ‘findclass’ is pretty useful too, this pulls out whatever node within an element fits the class name.

This next part is pretty cool: if you have a list of some DOM element (so ‘tr’, or ‘div’) you can set up a loop to go through the child elements. Thats basically what I do for each td element, i go through the div’s in it with a loop, then I convert each div to a string to do some precise scraping with ‘tostring’ as follows:

‘string = lxml.html.tostring(d)’.

At that point, since its a string, I can apply some regular expression methods to pull out what I want.

The scraperwiki module is something specific to when you code in the scraperwiki cloud. You can use it to save your data into a table, among other things I have yet to explore.  I’m excited about using it for OFI in collaboratively collecting and maintaining the data for open use.  If this sounds cool to you, hop on the github or the Mozilla drumbeat page for OFI.