HTML::TreeBuilder::XPath: identifing the xpath-expression

Hello good evening dear -Coder on Stackoverflow!

finally back again!

i am currently workin on a parser script: I have to parse all the detail-pages of this site here: [link text][1] Note: a very big and powerful suiss site - a governmental server with lots of power! There are several ways to do it. i have to get rid of a lot of crap by only using the text data out of the page... See the page - wich is very very simple - take this examplepage - eg.

Altes Schulhaus Ossingen
Guntibachstrasse 10
8475  Ossingen
[email protected]
Tel:052 317 15 45
Fax:052 317 04 42

Well we see - i need a little PERL-script to get this [B]six-lines[/B] of text out of the HTML-page. Well - how we do that: Personally I like HTML::TreeBuilder::XPath that we would have to install from CPAN. Here is how we would then extract the name from one of the files with it:

[B]Note:[/B] i am not sure about the Arguments that i have to take! See below my trials:

use strict;
use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});

print $name->as_text;

Note - you see that i have some problems with the arguments. As we can see we simply use an xpath-expression to indentify the node we want. [B]So how to determine that?[/B] Hmm - i tried to use a Firefox-plugin called XPather, that allows us to simply click on a html-element and extract the corresponding xpath. So we load the file we want to parse in Firefox, click on the stuff we want, get the xpath and use that in the perl-script. Well i am not very sure, that i did the job with XPather very well. I tired to find the arguments for the follwing page: See the page - wich is very very simple: see the details of a result page - derived from this site - very big and powerful suiss site - a governmental server with lots of power [see above the link]

See below [B]my trials[/B]: the arguments that i found with XPather ... are they really arguments -that help me to parse the above mentioned detail-result-page: [see above the link]

 /html/body/div[3]/text()
 /html/body/div[4]/text()
 /html/body/div[6]/text()
 /html/body/div[7]/text()
 /html/body/div[9]/a/text()
 /html/body/div[10]/text()
 /html/body/div[11]/text()[1]
 /html/body/div[11]/text()[2]
 /html/body/div[12]/text()[1]
 /html/body/div[12]/text()[2]
 /html/body/div[13]/text()

[see above the link]

see the html code

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;&lt;html&gt;&lt;head&gt;&lt;meta name="generator" content="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"&gt;&lt;title&gt;educa.ch&lt;/title&gt;&lt;meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><script src="102.htm"></script><script language="JavaScript"><!--
var did='d79376';
var root=new Array('d200','d205','d73137','d1566','d79376','d');
var usefocus = 1;
function check() {
if ((self.focus) && (usefocus)) {
self.focus();
}
}
// --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onload="check();"><table cellspacing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></td><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz</td><td width="20" class="popuphead" valign="middle"><a href="#" title="Print" onclick="window.print(); return false;"><img src="../pics/print16x13.gif" alt="Drucken" width="16" height="13"></a></td><td width="20" class="popuphead" valign="middle"><a href="#" title="close" onclick="window.close(); return false;"><img src="../pics/close21x13.gif" alt="Schliessen" width="21" height="13"></a></td></tr>
<tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width="1" height="1"></td></tr></table><div class="leerzeile"> </div><div class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Altes Schulhaus Ossingen    </div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div><img src="/0.gif" alt="" width="15" height="8">8475  Ossingen</div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8"><a href="" target="_blank"></a></div><div><img src="/0.gif" alt="" width="15" height="8"><a href="mailto: [email protected]">[email protected]</a></div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8">Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div><div> </div></body></html>

So i hope someone would like to review my little Perl-script and helps me with finding the right arguments for the perl-script!

love to hear from you

[B]BTW [/B]- the other tasks are as well important.

how should i fetch the pages: with LWP or Mechanize or something like that!?
How to store the data in a MySQL-database...!?

greetings :cool:

ansaurus

tags:

views:

answers:

HTML::TreeBuilder::XPath: identifing the xpath-expression

related questions