views:

44

answers:

1

I need to scrape Form 10-K reports (i.e. annual reports of US companies) from SEC website for a project.

The trouble is, companies do not use the exact same format for filing this data. So for ex., real estate data for 2 different companies could be displayed as below

1st company

Property name   State  City     Ownership   Year  Occupancy Total Area
-------------   -----  ------   ---------   ----  --------- ----------
ABC Mall         TX    Dallas   Fee         2007    97%       1,347,377
XYZ Plaza        CA    Ontario  Fee         2008    85%       2,252,117



2nd company

Property          % Ownership  %Occupany  Rent   Square Feet
---------------   -----------  ---------  -----  -----------
New York City
  ABC Plaza       100.0%        89.0%     38.07    2,249,000 
  123 Stores      100.0%        50.0%     18.00    1,547,000 
Washington DC Office
  12th street     .......
  2001, J Drive   .......

etc.

Likewise, the data layout could be entirely different for other companies.

I would like to know if there are better ways to scrape this type of heterogenous data other than writing complex regex searches.

I have the liberty to use Java, Perl, Python or Groovy for this work.

+2  A: 

I'd be inclined to keep a library of meta files that describe the layout for each page you want to scrape data from and use it when trying to get the data.

In that way you don't need complex reg-ex commands and if a site changes its design you simply change a single one of your files.

How you decide to create the meta file is up to you but things like pertinent class names or tags might be a good start.

then describe how to extract the data from that tag.

Unsure if there is a tool out there that does all that.

The other, nicer, way might be to contact the owners of these sites and see if they provide a feed in the form of a WebService or something that you can use to get the data. Saves a lot of heartache I should think.

griegs
The "cool, vintage 2007" way would be to ask for an RSS feed. Good luck finding someone at the SEC who understands either "Web Service" or "RSS feed".
Dan
Hehe, yeah agreed. Maybe they could send a parchment via carrier pigeon once a month
griegs