How do I convert the contents of to CSV format? Is there a library or linux program that does this? This is similiar to copy tables in Internet Explorer, and pasting them into Excel.
I'm not sure if there is pre-made library for this, but if you're willing to get your hands dirty with a little Perl, you could likely do something with Text::CSV
and HTML::Parser
.
This method is not really a library OR a program, but for ad hoc conversions you can
- put the HTML for a table in a text file called something.xls
- open it with a spreadsheet
- save it as CSV.
I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.
But you probably would prefer a Perl or Ruby script...
Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/
require 'nokogiri'
doc = Nokogiri::HTML(table_string)
doc.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
end
print "\n"
end
Worked for my basic test case.
With Perl you can use the HTML::TableExtract
module to extract the data from the table and then use Text::CSV_XS
to create a CSV file or Spreadsheet::WriteExcel
to create an Excel file.
here's a few options
http://groups.google.com/group/ruby-talk-google/browse%5Fthread/thread/cfae0aa4b14e5560?hl=nn
http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/
http://stackoverflow.com/questions/259091/how-can-i-scrape-an-html-table-to-csv
Here is an example using pQuery and Spreadsheet::WriteExcel:
use strict;
use warnings;
use Spreadsheet::WriteExcel;
use pQuery;
my $workbook = Spreadsheet::WriteExcel->new( 'data.xls' );
my $sheet = $workbook->add_worksheet;
my $row = 0;
pQuery( 'http://www.blahblah.site' )->find( 'tr' )->each( sub{
my $col = 0;
pQuery( $_ )->find( 'td' )->each( sub{
$sheet->write( $row, $col++, $_->innerHTML );
});
$row++;
});
$workbook->close;
The example simply extracts all tr tags that it finds into an excel file. You can easily tailor it to pick up specific table or even trigger a new excel file per table tag.
Further things to consider:
- You may want to pick up td tags to create excel header(s).
- And you may have issues with rowspan & colspan.
To see if rowspan or colspan is being used you can:
pQuery( $data )->find( 'td' )->each( sub{
my $number_of_cols_spanned = $_->getAttribute( 'colspan' );
});
/I3az/
OpenOffice.org can view HTML tables. Simply use the open command on the HTML file, or select and copy the table in your browser and then Paste Special in OpenOffice.org. It will query you for the file type, one of which should be HTML. Select that and voila!