ansaurus

Question

How can I extract the contents of a specific table from HTML source using Perl?

Answer 1

+3 A:

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new( attribs => {
    border => 0,
    bgcolor => '#EFEFEF',
    leftmargin => 15,
    topmargin => 5,
});

$te->parse_file('kultus-bw.html');
my ($table) = $te->tables;

for my $row ( $table->rows ) {
    cleanup(@$row);
    print "@$row\n";
}

sub cleanup {
    for ( @_ ) {
        s/\s+//;
        s/[\xa0 ]+\z//;
        s/\s+/ /g;
    }
}

Output:

Schul-/Behördenname: Abendgymnasium Ostwürttemberg
Schulart: Privatschule (04313488)
Hausadressse: Friedrichstr.70, 73430 Aalen
Postfachadresse: Keine Angabe
Telefon: 07361/680040
Fax: 07361/680040
E-Mail: Keine Angabe
Internet: www.abendgymnasium-ostwuerttemberg.de 
ÜbergeordneteDienststelle: Regierungspräsidium Stuttgart Abteilung 7 Schule und Bildung 
Schulleitung: Keine Angabe
Stellv.Schulleitung: Keine Angabe
AnzahlSchüler: 259
AnzahlKlassen: 8
AnzahlLehrer: Keine Angabe
Kreis: Ostalbkreis
Schulträger: <Verband/Verein> (Verband/Verein)

Of course, I saved a local copy of the page before running the script.

Sinan Ünür 2010-10-16 13:56:11

Hello Sinan Ünür, that is great. You did more than expected. I am very happy. This is exactly what i want to have as result. I am overwhelmed. This code you suggest does the whole trick!

thebutcher 2010-10-16 16:47:48

Again Sinan Ünur, I am excited. I have to understand the code. You get great outputs. And i love to understand this great code - that is able to give this ouptut! GREAT Job. - I come back later this day. Now i have leave the house for two hours. But sure thing i come back! Many many thanks again! Greetings martin

thebutcher 2010-10-16 17:06:26

again me - tried to correct the thread-title. The system denied that since i am a new user. Martin

thebutcher 2010-10-16 17:09:32

@martin Post the correct title as a comment to your question and someone else will fix it for you. Also, the documentation for `HTML::TableExtract` is available. It does what it says it does: Extracts specific tables from HTML source code. And it does that *really* well.

Sinan Ünür 2010-10-16 17:29:51

many thanks sinan!

thebutcher 2010-10-18 19:05:49

Answer 2

A:

Hello Sinan Ünür

[since this is a important question i do the unexpected - and repost in my own thread - (which is not usual - sorry for doing that!!!! But above all - i really love this great page!]

Many many thanks to you for the great help. This site is really overwelming. I love it for its great great supportive user. You helped me alot!

I also read the documentation for HTML::TableExtract which you pointed to. The script it does what it says it does: Extracts specific tables from HTML source code. And it does that really well.

BTW: i want (need to do this with another table/site:

See this page: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=672.8924536341191

Note: click all checkbuttons at the bottom of the site: Then you see a result-page with more than 6400 school-results: see at the right of the site Weitere Informationen anzeigen you can get detailed information if you click Weitere Informationen anzeigen

9 (or ten lines)

Schuldaten. Schulnummer: Amtliche Bezeichnung: Strasse: Plz und Ort: Telefon: Fax: E-Mail-Adresse: Schuldaten ändern] :(this is UTF8 encoded or what) Schülergesamtzahl (this is UTF8 encoded or what)

Question: can the HTML::TableExtract can be applied here to!? at the resultpage of more than 6400 shools: (See above)

Love to hear from you

thebutcher 2010-10-16 19:43:14

Please consider adding the bulk of this text as a comment to Sinan's answer.

Bart J 2010-10-18 06:26:57

ansaurus

tags:

views:

answers:

How can I extract the contents of a specific table from HTML source using Perl?

related questions