views:

1345

answers:

7

Hi,

I have an HTML page which has a flash chart(FusionCharts) and HTML table. I need to convert this whole thing into Excel. HTML table should be displayed in cells of excel sheet. Flash chart can be displayed as an image.

Is there any open source API that we could use for achieving this. Could you let me know what are the possible options.

Can this be done by using javascript alone.

Thanks & Regards `Shafi

+1  A: 

There's a very good Java POI api that would let you do that, but it's Java. http://poi.apache.org/

If you're on Win32 you can also use Excel's COM api, there are quite a few tutorials on the net.

alamar
+3  A: 

The HTML table is relatively easy. You can download the page, parse the HTML (there are various HTML parsing libraries available), extract the table and convert it into CSV (which Excel can load), or directly create an Excel file, e.g. using Java POI, as suggested above.

The Flash part is significantly harder. There are quite a few tools available to capture flash to an image, you'd need to use one of them. This can be tricky, as Flash might be interactive, so you'd possibly have to remote-control the Flash part so it shows the right image before capturing. Hard to tell without more info.

That said, screen-scraping (which is what you're doing) is always labour-intensive and fragile. You should really push for a better interface to get your data from, it will save loads of hassle in the long run.

sleske
+1  A: 

What you're trying to do is fragile and difficult to maintain. You should attempt to create a csv feed to fetch the data. All it takes is for someone to come along and modify the HTML and your scraper will throw up on it (probably years after anyone remembers how your program works).

Try to get CSV and image data from the original source (ie, database or whatever) and build the Excel file from that.

SpliFF
+1  A: 

I will add to SpliFF's answer that when you have your data as a CSV file you can set the mime type of the page to application/vnd.ms-excel which will open the page in Excel

Keith Bloom
wow. I didn't know that. But Excel often sucks for CSV in regions that use ";" as the list separator as opposed to ","...
Daren Thomas
I agree and the list of things that Excel sucks at could go on for a while :)I mainly used this to present reports coming from a server side script so I had the ability to ensure the data format wouldn't hurt Excel.
Keith Bloom
+1 for mentioning regions with ";" separator. In the Netherlands the decimal separator is a comma. When floats have a dot, excel interprets the number as a string, unless I change the regional settings in windows. Generating CSV on the server and expecting it to load in everyone's excel is guaranteed to fail.
Wouter van Nifterick
+1  A: 

Excel can convert HTML tables by default. The easiest way to force it to do this is to save the HTML file with an XLS extension. Excel will then open the XLS as if it were its native workbook.

Bojan Resnik
+2  A: 

Just set the content type of the page to "application/vnd.ms-excel". If the html page is just a table it will open with excel and look perfect. You can even add background colors and font styles.

Try some of these content types

application/excel
application/vnd.ms-excel
application/x-excel
application/x-msexcel
MikeNereson
+1  A: 

I cannot offer any advice on the Flash part, but I have done HTML table to Excel many times. Yes, Excel can open HTML tables but most HTML tables out there have extraneous crap in them that can make it fragile to consistently parse the tables.

CPAN module HTML::TableExtract isa wonderful module that allows you to focus on the non-presentation specific aspects of the table you are trying to extract. Just specify the column headings you are interested in and maybe specify the title or class of the table and you are mostly set. You might have to post process the rows returned a little, but that is considerably easier than dealing with the underlying tag soup in all its glory.

Further, for output to Excel format, stick with Spreadsheet::WriteExcel rather than the OLE interface. That way, you do not depend on having Excel installed for your program to work and things go a little faster.

Make sure you specify the data type of cells if you do not want content to be changed automatically by Excel upon opening the files (another reason I do not like sending around CSV files). Use a configuration file for formatting information so that you can change how the spreadsheet looks without having to change the program.

You can always use Excel's built-in charting facilities to replace the web site graphs.

This combination has enabled me to generate pretty good looking documents comprising several hundreds of megabytes of scraped data (with logos and image links etc) using just a few hundred lines of Perl and a couple of days' work. Good luck.

Sinan Ünür