tags:

views:

667

answers:

3

Possible duplicate:
Can you provide an example of parsing HTML with your favorite parser?
How can I extract content from HTML files using Perl?


I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:

<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>

Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.

Any help would be greatly apreciated!

Ok sorry... I've tried many regular expressions such as:

$lines =~ / (<td>) /
       OR
$lines =~ / <td>(.*)< /
       OR
$lines =~ / >(.*)< /

My current program looks like so:

#!perl -w

open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;

open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;

my $lines = join '', <INPUT_FILE>;

print "Hello 2\n";

if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
    if ($lines =~ / <td>(.*)< / ) {
    print "this is the 2nd 1: \n";
    print $1;
    print "the word was: $1.\n";
    $Time = $1;
    print $Time;
    print OUTPUT_FILE $Time;
    } else {
    print "2ND IF FAILED\n";
    }
} else { 
print "THIS FAILED\n";
}

close(INPUT_FILE);
close(OUTPUT_FILE);
+10  A: 

Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.

Ether
In this case the requested parsing is rather simple though.
Kinopiko
@Ether It seems to me some people enjoy torturing themselves. I don't know why.
Sinan Ünür
@Sinan: My theory is that there is a special kind of learning curve with regexes: at first they seem so mind-blowing that there's nothing they can't (or shouldn't) do. Anything that looks like a parsing problem therefore *must* be solvable with regexes.
Ether
+1  A: 

That's an easy one:

my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my @stuff = $html =~ />([^<]+)</g;
print join (", ", @stuff), "\n";

See http://codepad.org/qz9d5Bro if you want to try running it.

Kinopiko
Thank you sooo much!!! :)
nick
Wait until you see the DOWNVOTES I get for telling you this.
Kinopiko
why? and i tested it works... thanks :)
nick
@nick because this is the kind of approach that will keep one wasting a lot more time and effort again and again always looking for just the right regex each time one needs to parse HTML.
Sinan Ünür
Parsing JSON with regular expressions is just as hard as parsing HTML, and yet one of the people on a previous discussion, http://stackoverflow.com/questions/1598053/how-can-i-remove-external-links-from-html-using-perl/1598069#1598069, who was most dogmatic about not using regexes for parsing HTML then went on to approve of a solution to a problem which involved using regexes to parse JSON: http://stackoverflow.com/questions/1636352/using-regular-expressions-in-shell-script/1636508#1636508.
Kinopiko
Sorry, the above link is slightly wrong: http://stackoverflow.com/questions/1636352/using-regular-expressions-in-shell-script/
Kinopiko
Well, I cannot speak for others. I do think using regular expressions was a waste of time in that case as well. So, I added a Perl one liner using `JSON.pm` to that thread.
Sinan Ünür
And upvoted your answer in that thread.
Sinan Ünür
Thanks. 15 characters.
Kinopiko
@Kinopiko, it appears that too few people on SO understand the Chomsky Hierarchy. Parsing JSON with regexes is foolish, even moreso than HTML, since a real parser is available that is so much simpler to use than any half-assed regex solution could ever hope to be. This demonstrates the value of CS in educating programmers.
daotoad
+4  A: 

Use HTML::TableExtract. Really.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TableExtract;
use LWP::Simple;

my $file = 'Table3.htm';
unless ( -e $file ) {
    my $rc = getstore(
        'http://www.ntsb.gov/aviation/Table3.htm',
        $file);
    die "Failed to download document\n" unless $rc == 200;
}

my @headers = qw( Year Fatalities );

my $te = HTML::TableExtract->new(
    headers => \@headers,
    attribs => { id => 'myTable' },
);

$te->parse_file($file);

my ($table) = $te->tables;

print join("\t", @headers), "\n";

for my $row ($te->rows ) {
    print join("\t", @$row), "\n";
}

This is what I meant in another post by "task-specific" HTML parsers.

You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.

Sinan Ünür