I have to parse 5000 files - which look pretty identical.
I like using HTML::TokeParser::Simple and DBI in order to do the parsing job and store the results.
I have little experience with HTML::TokeParser::Simple
but this task goes over
my head. Note: i also have had a look at the ideas - that seems to be also an appropiate way. But at the moment i have issues to get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions that needs to be filled in the Perl-programme.
This is what I have right now:
use strict;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
#use real file name here
open(my $fh, "<", "file.html") or die $!;
$tree->parse_file($fh);
my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($type) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress_two) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($telephone) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($fax) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($internet) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($officer) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($employees) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($offices) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($worker) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($country) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($the_council)= $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
print $name->as_text;
print $type->as_text;
print $adress->as_text;
print $adress_two->as_text;
print $telephone->as_text;
print $fax->as_text;
print $internet->as_text;
print $officer->as_text;
print $employees->as_text;
print $offices->as_text;
print $worker->as_text;
print $country->as_text;
print $the_council->as_text;
is this all right ? Note - i w ant to store this in a database.
BTW: See one of the example sites:
in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!
That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.
Can i make use of the above mentioned code... or do i have to change it.
Love to hear from you! That would be great!!