views:

25

answers:

1

I tried to run the following Perl script on the HTML further below. My problem is how to define the correct hash reference, with attribs that specify attributes of interest within my HTML <table> tag itself.

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;


my $table = HTML::TableExtract->new(keep_html=>0, depth => 1, count => 1, br_translate => 0 ); 

$table->parse($html);
foreach my $row ($table->rows) 

sub cleanup {
    for ( @_ ) {
        s/\s+//;
        s/[\xa0 ]+\z//;
        s/\s+/ /g;
    }
}

{ print join("\t", @$row), "\n"; }

I want to apply this code on the HTML-document you see further below.

My first approach is to do this with the columns method. But i am not able to figure out how to use the columns method on the below HTML-file: My intuition makes me think it should be something like the following (but my intuition is wrong):

foreach my $column ($table->columns) { 
    print join("\t", @$column), "\n"; 
}

The HTML::TableExtract documentation doesn't shed much light (for me anyway).

I can see in the code of the module that the columns method belongs to HTML::TableExtract::Table, but I can't figure out how to use it. I appreciate any help.

Background:

I try to get the table extracted and I have a very very small document of tables that i want to parse with the HTML::TableExtract module I am trying to search for keywords in the HTML - so that i can take them for the attribs I have to print only the necessary data.

I tried going CPAN but could not really find how to search through it for particular keywords. One way to do it would be HTML::TableExtract - the other way would be to parse with HTML::TokeParser I have very little experience with HTML::TokeParser.

Well - one or the other way i need to do this parsing: I want to output the result of the parsed tables into some .text - or even better store it into a database. The problem here is I cant find anyway to search through the resulting parsed table and get necessary data.

The HTML

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">

<link rel="stylesheet" href="jspsrc/css/bp_style.css" type="text/css">

<title>Weitere Schulinformationen</title>
</head>

<body class="bodyclass">
<div style="text-align:center;"><center>
<!-- <fieldset><legend> general information  </legend>
-->
<br/>

<table border="1" cellspacing="0" bordercolordark="white" bordercolorlight="black" width="80%" class='bp_result_tab_info'>
<!-- <table border="0" cellspacing="0" bordercolordark="white" bordercolorlight="black" width="80%" class='bp_search_info'>
-->  
  <tr>
    <td width="100%" colspan="2" class="ldstabTitel"><strong>data_one </strong></td>
  </tr>
  <tr>
    <td width="27%"><strong>data_two</strong></td>
    <td width="73%">&nbsp;116439
  </td>
  </tr>
  <tr>
    <td width="27%"><strong>official_description</strong></td>
    <td width="73%">the name </td>
  </tr>
  <tr>
    <td width="27%"><strong>name of the street</strong></td>
    <td width="73%">champs elysee</td>
  </tr>
  <tr>
    <td width="27%"><strong>number and town</strong></td>
    <td width="73%"> 75000 paris </td>
  </tr>
  <tr>
    <td width="27%"><strong>telefon</strong></td>

    <td width="73%">&nbsp;000241 49321
</td>
  </tr>
  <tr>
    <td width="27%"><strong>fax</strong></td>
    <td width="73%">&nbsp;000241 4093287
</td>
  </tr>
  <tr>
  <td width="27%"><strong>e-mail-adresse</strong></td>
  <td width="73%">&nbsp;<a href=mailto:1111116439@my_domain.org>[email protected]</a>
</td>
  </tr>
  <tr>
    <td width="27%"><strong>internet-site</strong></td>
    <td width="73%">&nbsp;<a href=http://www.thesite.org&gt;http://www.thesite.org&lt;/td&gt;
 </tr>
<!--  
<tr>
    <td width="27%">&nbsp;</td>
    <td width="73%" align="right"><a href="schule_aeinfo.php?SNR=<? print $SCHULNR ?>" target="_blank">
    [Schuldaten &auml;ndern]&nbsp;&nbsp;</a>
</tr>
</td> -->
<tr>
  <td width="27%">&nbsp;</td>
  <td width="73%">the department</td>
 </tr> 

  <tr>
    <td width="100%" colspan=2><strong>&nbsp;</strong></td>
 </tr> 
 <tr>
    <td width="27%"><strong>number of indidviduals</strong></td>
    <td width="73%">&nbsp;192</td>
<tr>
    <td width="100%" colspan=2><strong>&nbsp;</strong></td>
   </tr>
  <!-- if (!fsp.isEmpty()){
 ztext = "&nbsp;";

 int i = 0;
 Iterator it = fsp.iterator();
 while (it.hasNext()){
  String[] zwert = new String[2];
  zwert = (String[])it.next();

  if (i==0){
   if (zwert[1].equals("0")){
    ztext = ztext+zwert[0];
   }else{
    ztext = ztext+zwert[0]+" mit "+zwert[1];
    if (zwert[1].equals("1")){
     ztext = ztext+" Sch&uuml;ler";
    }else{
     ztext = ztext+" Sch&uuml;lern";
    }
   } 
   i++;
  }else{
   if (zwert[1].equals("0")){
    ztext = ztext+"<br>&nbsp;"+zwert[0];
   }else{
    ztext = ztext+"<br>&nbsp;"+zwert[0]+" mit "+zwert[1];
    if (zwert[1].equals("1")){
     ztext = ztext+" Sch&uuml;ler";
    }else{
     ztext = ztext+" Sch&uuml;lern";
    }
   } 
  }  
 } 

-->





</table>
<!--  </fieldset>  -->
<br>

</body>
</html>

Thanks for any and all help.

A: 

You need to provide something that uniquely identifies the table in question. This can be the content of its headers or the HTML attributes. In this case, there is only one table in the document, so you don't even need to do that. But, if I were to provide anything to the constructor, I would provide the class of the table.

Also, I do not think you want the columns of the table. The first column of this table consists of labels and the second column consists of values. To get the labels and values at the same time, you should process the table row-by-row.

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new(
    attribs => { class => 'bp_result_tab_info' },
);

$te->parse_file('t.html');

for my $table ( $te->tables ) {
    print Dump $table->columns;
}

Output:

---
- 'data_one '
- data_two
- official_description
- name of the street
- number and town
- telefon
- fax
- e-mail-adresse
- internet-site
- á
- á
- number of indidviduals
- á
---
- ~
- "á116439\r\n  "
- 'the name '
- champs elysee
- ' 75000 paris '
- "á000241 49321\r\n"
- "á000241 4093287\r\n"
- "á[email protected]\r\n"
- áhttp://www.thesite.org
- the department
- ~
- á192
- ~

Finally, a word of advice: It is clear that you do not have much of an understanding of Perl (or HTML for that matter). It would be better for you to try to learn some of the basics first. This way, all you are doing is incorrectly copying and pasting code from one answer into another and not learning anything.

Sinan Ünür
hello Sinan. Again - Many many thanks for all you did. You are a true Perl-Expert. Your advice (s) are great places to learn! And yes. I am willing to learn BASICS! My problem is i have to solve some real live problems. That makes me trying to learn in the real- live...; But thats another story.. Many Thanks again - Martin
thebutcher