ansaurus

Question

How can I parse only part of an HTML file and ignore the rest?

Answer 1

+1 A:

Do you mean the 999th line or the 999th table row?

The former might be

perl -ne 'print if $. == 999' /path/to/*.dat

The latter would involve an HTML parser and some selection logic. A Sax parser might be better for fast processing of a large number of files. It probably depends which version of HTML is used and whether it is "well-formed".

Perl has many XML and HTML parsers - did you have any particular module in mind?

EDIT:

Your problem seems to be your XPath expression. The actual HTML is much more complex than your XPath suggests. The following expression works better

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;

#
# replace this with a loop over 5000 existing files
#
my $url = 'http://www.kultusportal-bw.de/'.
          'servlet/PB/menu/1188427/index.html'.
          '?COMPLETEHREF='.
          'http://www.kultus-bw.de/'.
          'did_abfrage/detail.php?id=04313488';
my $html = get $url;

my $tree = HTML::TreeBuilder::XPath->new();
#
# within the loop process the html like this
#
$tree->parse($html);
$tree->eof;
print $tree->findvalue('//table[@bgcolor]/tr[1]');

Try cutting the above and pasting into a file then running it with Perl.

RedGrittyBrick 2010-10-16 00:02:06

use HTML::TreeBuilder::XPath;my $tree = HTML::TreeBuilder::XPath->new;#use real file name hereopen(my $fh, "<", "file.html") or die $!;$tree->parse_file($fh);my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});print $name->as_text;the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 HTML-files -

thebutcher 2010-10-16 09:40:32

Hello RedGrittyBrick: Guess that i now understand your code. You did the trick with the color! YOu solved the issue with working with the grey shadowed color! Is this true!? GREAT Job! I am overwhelmed. Congrats. Greetings Martin

thebutcher 2010-10-16 17:19:03

If you're going to show code, please update your question. Forcing people to read code in comments is cruel.

brian d foy 2010-10-16 19:17:52

hello brian d foy - thx for the posting. I agree. Being a novice i have to learn alot! - greetings

thebutcher 2010-10-16 20:16:12

@Martin, yes - the HTML has several tables, therefore specify a table attribute that uniquely identifies which table you are interested in. I found it worth reading the W3C tutorials on XPath expressions.

RedGrittyBrick 2010-10-16 22:37:02

Answer 2

A:

Hello RedGrittyBrick many many thanks for the writing. Great to hear again from you!

i try to verify your posting - and to understand it! I understand the usage of the modules and the the idea of the loop over the 5000 files.

I try to understand the code - i will try it out at the weekend. i come back and report all my findings.

Best regards

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;

#
# replace this with a loop over 5000 existing files
#
my $url = 'http://www.kultusportal-bw.de/'.
          'servlet/PB/menu/1188427/index.html'.
          '?COMPLETEHREF='.
          'http://www.kultus-bw.de/'.
          'did_abfrage/detail.php?id=04313488';
my $html = get $url;

my $tree = HTML::TreeBuilder::XPath->new();
#
# within the loop process the html like this
#
$tree->parse($html);
$tree->eof;
print $tree->findvalue('//table[@bgcolor]/tr[1]');

thebutcher 2010-10-16 17:00:17

This should be edited into your question, not posted as an answer.

Ether 2010-10-16 20:44:12

ansaurus

tags:

views:

answers:

How can I parse only part of an HTML file and ignore the rest?

related questions