ansaurus

Question

How can I extract data from HTML tables in Perl?

Answer 1

+10 A:

Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.

Ether 2009-10-30 17:42:00

In this case the requested parsing is rather simple though.

Kinopiko 2009-10-30 18:10:33

@Ether It seems to me some people enjoy torturing themselves. I don't know why.

Sinan Ünür 2009-10-30 19:51:36

@Sinan: My theory is that there is a special kind of learning curve with regexes: at first they seem so mind-blowing that there's nothing they can't (or shouldn't) do. Anything that looks like a parsing problem therefore *must* be solvable with regexes.

Ether 2009-10-31 17:49:21

Answer 2

+1 A:

That's an easy one:

my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my @stuff = $html =~ />([^<]+)</g;
print join (", ", @stuff), "\n";

See http://codepad.org/qz9d5Bro if you want to try running it.

Kinopiko 2009-10-30 18:06:46

Thank you sooo much!!! :)

nick 2009-10-30 18:11:17

Wait until you see the DOWNVOTES I get for telling you this.

Kinopiko 2009-10-30 18:14:14

why? and i tested it works... thanks :)

nick 2009-10-30 18:23:48

@nick because this is the kind of approach that will keep one wasting a lot more time and effort again and again always looking for just the right regex each time one needs to parse HTML.

Sinan Ünür 2009-10-30 20:00:50

Parsing JSON with regular expressions is just as hard as parsing HTML, and yet one of the people on a previous discussion, http://stackoverflow.com/questions/1598053/how-can-i-remove-external-links-from-html-using-perl/1598069#1598069, who was most dogmatic about not using regexes for parsing HTML then went on to approve of a solution to a problem which involved using regexes to parse JSON: http://stackoverflow.com/questions/1636352/using-regular-expressions-in-shell-script/1636508#1636508.

Kinopiko 2009-10-31 00:28:46

Sorry, the above link is slightly wrong: http://stackoverflow.com/questions/1636352/using-regular-expressions-in-shell-script/

Kinopiko 2009-10-31 00:58:52

Well, I cannot speak for others. I do think using regular expressions was a waste of time in that case as well. So, I added a Perl one liner using `JSON.pm` to that thread.

Sinan Ünür 2009-10-31 01:06:57

And upvoted your answer in that thread.

Sinan Ünür 2009-10-31 01:11:41

Thanks. 15 characters.

Kinopiko 2009-10-31 01:36:14

@Kinopiko, it appears that too few people on SO understand the Chomsky Hierarchy. Parsing JSON with regexes is foolish, even moreso than HTML, since a real parser is available that is so much simpler to use than any half-assed regex solution could ever hope to be. This demonstrates the value of CS in educating programmers.

daotoad 2009-10-31 07:35:06

Answer 3

+4 A:

Use HTML::TableExtract. Really.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TableExtract;
use LWP::Simple;

my $file = 'Table3.htm';
unless ( -e $file ) {
    my $rc = getstore(
        'http://www.ntsb.gov/aviation/Table3.htm',
        $file);
    die "Failed to download document\n" unless $rc == 200;
}

my @headers = qw( Year Fatalities );

my $te = HTML::TableExtract->new(
    headers => \@headers,
    attribs => { id => 'myTable' },
);

$te->parse_file($file);

my ($table) = $te->tables;

print join("\t", @headers), "\n";

for my $row ($te->rows ) {
    print join("\t", @$row), "\n";
}

This is what I meant in another post by "task-specific" HTML parsers.

You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.

Sinan Ünür 2009-10-30 19:43:13

ansaurus

tags:

views:

answers:

How can I extract data from HTML tables in Perl?

related questions