views:

107

answers:

2

In each of 5,000 HTML files I have to get only one line of text, which is line 999. How can I tell the HTML::Parser that I only have to get line 999?

</p><h1>dataset 1:</h1>

&nbsp;<table border="0" bgcolor="#EFEFEF"  leftmargin="15" topmargin="5"><tr>  
<td><strong>name:</strong>&nbsp;</td>  <td width=500> myname one         </td></tr><tr>  
<td><strong>type:</strong>&nbsp;</td>  <td width=500>       type_one  (04313488)        </td></tr><tr>
<td><strong>aresss:</strong>&nbsp;</td><td>Friedrichstr. 70,&nbsp;73430&nbsp;Madrid</td></tr><tr>  
<td><strong>adresse_two:</strong>&nbsp;</td>  <td>          no_value        </td></tr><tr>  
<td><strong>telefone:</strong>&nbsp;</td>  <td>         0000736111/680040        </td></tr><tr>  
<td><strong>Fax:</strong>&nbsp;</td>  <td>          0000736111/680040        </td></tr><tr>  
<td><strong>E-Mail:</strong>&nbsp;</td>  <td>       Keine Angabe        </td></tr><tr>      
<td><strong>Internet:</strong>&nbsp;</td><td><a href="http://www.mysite.es" target="_blank">www.mysite.es</a><br></td></tr><tr> <td><strong>the office:</strong>&nbsp;</td>   
<td><a href="http://www.mysite_two" target="_blank">mysite_two </a><br></td></tr><tr> 
<td><strong>:</strong>&nbsp;</td><td> no_value </td></tr><tr> 
<td><strong>officer:</strong>&nbsp;</td>  <td> no_value        </td>  </td></tr><tr>
<td><strong>employees:</strong>&nbsp;</td>  <td> 259        </td></tr><tr>  
<td><strong>offices:</strong>&nbsp;</td>  <td>     8        </td></tr><tr>  
<td><strong>worker:</strong>&nbsp;</td>  <td>     no_value        </td></tr><tr>  
<td><strong>country:</strong>&nbsp;</td>  <td>    contryname        </td></tr><tr>  
<td><strong>the_council:</strong>&nbsp;</td>  <td> 

Well, the question is, is it possible to do the search in the 5000 files with this attribute: That the line 999 is of interest. In other words, can I tell the HTML-parser that it has to look (and extract) exactly line 999?


Hello dear RedGritty Brick - i have little experience with HTML :: TokeParser

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});

print $name->as_text;

BTW; RedGrittyBrick: See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488 in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

love to get hints

+1  A: 

Do you mean the 999th line or the 999th table row?

The former might be

perl -ne 'print if $. == 999' /path/to/*.dat

The latter would involve an HTML parser and some selection logic. A Sax parser might be better for fast processing of a large number of files. It probably depends which version of HTML is used and whether it is "well-formed".

Perl has many XML and HTML parsers - did you have any particular module in mind?


EDIT:

Your problem seems to be your XPath expression. The actual HTML is much more complex than your XPath suggests. The following expression works better

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;

#
# replace this with a loop over 5000 existing files
#
my $url = 'http://www.kultusportal-bw.de/'.
          'servlet/PB/menu/1188427/index.html'.
          '?COMPLETEHREF='.
          'http://www.kultus-bw.de/'.
          'did_abfrage/detail.php?id=04313488';
my $html = get $url;

my $tree = HTML::TreeBuilder::XPath->new();
#
# within the loop process the html like this
#
$tree->parse($html);
$tree->eof;
print $tree->findvalue('//table[@bgcolor]/tr[1]');

Try cutting the above and pasting into a file then running it with Perl.

RedGrittyBrick
use HTML::TreeBuilder::XPath;my $tree = HTML::TreeBuilder::XPath->new;#use real file name hereopen(my $fh, "<", "file.html") or die $!;$tree->parse_file($fh);my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});print $name->as_text;the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 HTML-files -
thebutcher
Hello RedGrittyBrick: Guess that i now understand your code. You did the trick with the color! YOu solved the issue with working with the grey shadowed color! Is this true!? GREAT Job! I am overwhelmed. Congrats. Greetings Martin
thebutcher
If you're going to show code, please update your question. Forcing people to read code in comments is cruel.
brian d foy
hello brian d foy - thx for the posting. I agree. Being a novice i have to learn alot! - greetings
thebutcher
@Martin, yes - the HTML has several tables, therefore specify a table attribute that uniquely identifies which table you are interested in. I found it worth reading the W3C tutorials on XPath expressions.
RedGrittyBrick
A: 

Hello RedGrittyBrick many many thanks for the writing. Great to hear again from you!

i try to verify your posting - and to understand it! I understand the usage of the modules and the the idea of the loop over the 5000 files.

I try to understand the code - i will try it out at the weekend. i come back and report all my findings.

Best regards

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;

#
# replace this with a loop over 5000 existing files
#
my $url = 'http://www.kultusportal-bw.de/'.
          'servlet/PB/menu/1188427/index.html'.
          '?COMPLETEHREF='.
          'http://www.kultus-bw.de/'.
          'did_abfrage/detail.php?id=04313488';
my $html = get $url;

my $tree = HTML::TreeBuilder::XPath->new();
#
# within the loop process the html like this
#
$tree->parse($html);
$tree->eof;
print $tree->findvalue('//table[@bgcolor]/tr[1]');
thebutcher
This should be edited into your question, not posted as an answer.
Ether