views:

198

answers:

3

I am doing a little data scraping, There are 3 types of file from which i am scraping data.

1- HTML
2- PDF
3- Excel(xls)

For HTML i am comfortable, i am using HTML Agility for that.

For PDF and excel i need suggestions from anyone.

Thanks in advance.

+1  A: 

Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java environment look at Apache POI.

EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM

renick
Absolutely recommend POI if prefer a Java/Groovy solution.Perl also has some pretty good APIs for spreadsheets and PDFs
James Anderson
A: 

Thanx for the replay, excel part is i think solve. But PDF, how to extract data from that? any idea

Sakhawat Ali
A: 

I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF files into Excel. We have used it with great success.

Govert