tags:

views:

1781

answers:

7

How do I convert the contents of to CSV format? Is there a library or linux program that does this? This is similiar to copy tables in Internet Explorer, and pasting them into Excel.

+5  A: 

I'm not sure if there is pre-made library for this, but if you're willing to get your hands dirty with a little Perl, you could likely do something with Text::CSV and HTML::Parser.

Chris Simmons
This is exactly the answer I'd have given. +1 :-)
Chris Jester-Young
A: 

This method is not really a library OR a program, but for ad hoc conversions you can

  • put the HTML for a table in a text file called something.xls
  • open it with a spreadsheet
  • save it as CSV.

I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.

But you probably would prefer a Perl or Ruby script...

pavium
+2  A: 

Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/

require 'nokogiri'

doc = Nokogiri::HTML(table_string)

doc.xpath('//table//tr').each do |row|
  row.xpath('td').each do |cell|
    print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
  end
  print "\n"
end

Worked for my basic test case.

audiodude
+2  A: 

With Perl you can use the HTML::TableExtract module to extract the data from the table and then use Text::CSV_XS to create a CSV file or Spreadsheet::WriteExcel to create an Excel file.

jmcnamara
+1  A: 

Here is an example using pQuery and Spreadsheet::WriteExcel:

use strict;
use warnings;

use Spreadsheet::WriteExcel;
use pQuery;

my $workbook = Spreadsheet::WriteExcel->new( 'data.xls' );
my $sheet    = $workbook->add_worksheet;
my $row = 0;

pQuery( 'http://www.blahblah.site' )->find( 'tr' )->each( sub{
    my $col = 0;
    pQuery( $_ )->find( 'td' )->each( sub{
        $sheet->write( $row, $col++, $_->innerHTML );
    });
    $row++;
});

$workbook->close;

The example simply extracts all tr tags that it finds into an excel file. You can easily tailor it to pick up specific table or even trigger a new excel file per table tag.

Further things to consider:

  • You may want to pick up td tags to create excel header(s).
  • And you may have issues with rowspan & colspan.

To see if rowspan or colspan is being used you can:

pQuery( $data )->find( 'td' )->each( sub{ 
    my $number_of_cols_spanned = $_->getAttribute( 'colspan' );
});

/I3az/

draegtun
A: 

OpenOffice.org can view HTML tables. Simply use the open command on the HTML file, or select and copy the table in your browser and then Paste Special in OpenOffice.org. It will query you for the file type, one of which should be HTML. Select that and voila!

Happy Gilmore