I need to scrape Form 10-K reports (i.e. annual reports of US companies) from SEC website for a project.
The trouble is, companies do not use the exact same format for filing this data. So for ex., real estate data for 2 different companies could be displayed as below
1st company
Property name State City Ownership Year Occupancy Total Area
------------- ----- ------ --------- ---- --------- ----------
ABC Mall TX Dallas Fee 2007 97% 1,347,377
XYZ Plaza CA Ontario Fee 2008 85% 2,252,117
2nd company
Property % Ownership %Occupany Rent Square Feet
--------------- ----------- --------- ----- -----------
New York City
ABC Plaza 100.0% 89.0% 38.07 2,249,000
123 Stores 100.0% 50.0% 18.00 1,547,000
Washington DC Office
12th street .......
2001, J Drive .......
etc.
Likewise, the data layout could be entirely different for other companies.
I would like to know if there are better ways to scrape this type of heterogenous data other than writing complex regex searches.
I have the liberty to use Java, Perl, Python or Groovy for this work.