views:

75

answers:

2

I have to parse 5000 files - which look pretty identical.

I like using HTML::TokeParser::Simple and DBI in order to do the parsing job and store the results.

I have little experience with HTML::TokeParser::Simple but this task goes over my head. Note: i also have had a look at the ideas - that seems to be also an appropiate way. But at the moment i have issues to get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions that needs to be filled in the Perl-programme.

This is what I have right now:

use strict;

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name)       = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($type)       = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress)     = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress_two) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($telephone)  = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($fax)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($internet)   = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($officer)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($employees)  = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($offices)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($worker)     = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($country)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($the_council)= $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});


print $name->as_text;
print $type->as_text;
print $adress->as_text;
print $adress_two->as_text;
print $telephone->as_text;
print $fax->as_text;
print $internet->as_text;
print $officer->as_text;
print $employees->as_text;
print $offices->as_text;
print $worker->as_text;
print $country->as_text;
print $the_council->as_text;

is this all right ? Note - i w ant to store this in a database.

BTW: See one of the example sites:

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

Can i make use of the above mentioned code... or do i have to change it.

Love to hear from you! That would be great!!

+3  A: 

Use some HTML::TableExtract magic:

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new( attribs => {
    border => 0,
    bgcolor => '#EFEFEF',
    leftmargin => 15,
    topmargin => 5,
});

$te->parse_file('kultus-bw.html');
my ($table) = $te->tables;

for my $row ( $table->rows ) {
    cleanup(@$row);
    print "@$row\n";
}

sub cleanup {
    for ( @_ ) {
        s/\s+//;
        s/[\xa0 ]+\z//;
        s/\s+/ /g;
    }
}

Output:

Schul-/Behördenname: Abendgymnasium Ostwürttemberg
Schulart: Privatschule (04313488)
Hausadressse: Friedrichstr.70, 73430 Aalen
Postfachadresse: Keine Angabe
Telefon: 07361/680040
Fax: 07361/680040
E-Mail: Keine Angabe
Internet: www.abendgymnasium-ostwuerttemberg.de 
ÜbergeordneteDienststelle: Regierungspräsidium Stuttgart Abteilung 7 Schule und Bildung 
Schulleitung: Keine Angabe
Stellv.Schulleitung: Keine Angabe
AnzahlSchüler: 259
AnzahlKlassen: 8
AnzahlLehrer: Keine Angabe
Kreis: Ostalbkreis
Schulträger: <Verband/Verein> (Verband/Verein) 

Of course, I saved a local copy of the page before running the script.

Sinan Ünür
Hello Sinan Ünür, that is great. You did more than expected. I am very happy. This is exactly what i want to have as result. I am overwhelmed. This code you suggest does the whole trick!
thebutcher
Again Sinan Ünur, I am excited. I have to understand the code. You get great outputs. And i love to understand this great code - that is able to give this ouptut! GREAT Job. - I come back later this day. Now i have leave the house for two hours. But sure thing i come back! Many many thanks again! Greetings martin
thebutcher
again me - tried to correct the thread-title. The system denied that since i am a new user. Martin
thebutcher
@martin Post the correct title as a comment to your question and someone else will fix it for you. Also, the documentation for `HTML::TableExtract` is available. It does what it says it does: Extracts specific tables from HTML source code. And it does that *really* well.
Sinan Ünür
many thanks sinan!
thebutcher
A: 

Hello Sinan Ünür

[since this is a important question i do the unexpected - and repost in my own thread - (which is not usual - sorry for doing that!!!! But above all - i really love this great page!]

Many many thanks to you for the great help. This site is really overwelming. I love it for its great great supportive user. You helped me alot!

I also read the documentation for HTML::TableExtract which you pointed to. The script it does what it says it does: Extracts specific tables from HTML source code. And it does that really well.

BTW: i want (need to do this with another table/site:

See this page: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=672.8924536341191

Note: click all checkbuttons at the bottom of the site: Then you see a result-page with more than 6400 school-results: see at the right of the site Weitere Informationen anzeigen you can get detailed information if you click Weitere Informationen anzeigen

9 (or ten lines)

Schuldaten. Schulnummer: Amtliche Bezeichnung: Strasse: Plz und Ort: Telefon: Fax: E-Mail-Adresse: Schuldaten ändern] :(this is UTF8 encoded or what) Schülergesamtzahl (this is UTF8 encoded or what)

Question: can the HTML::TableExtract can be applied here to!? at the resultpage of more than 6400 shools: (See above)

Love to hear from you

thebutcher
Please consider adding the bulk of this text as a comment to Sinan's answer.
Bart J