tags:

views:

312

answers:

4

Hi Guys,

I need to retrieve some data from a web page. After analysing the HTML code of the page, I found the data I need is embeded in a table with a unique table id. I don't know whether it is an HTML rule or not, anyway it's very good for parsing I think.

The data in the table is arranged as below (various attributes and tags have been omitted in order to give you a clear "data structure")

<table .... id = "tablename" .... >
    <tr>
         <td .... >filed1</td>
             ....
         <td .... >filedn</td>
    </tr>
         #several "trs" here
    <tr>
         <td .... >filed1</td>
             ....
         <td .... >filedn</td>
    </tr>
</table>

So my question is how to use Perl's HTML parser utility to meet my needs in this case.

Thanks in advance.

A: 

Look at Ken MacFarlane's Parsing HTML with HTML::Parser in The Perl Journal. I'm not sure if that's the parser you're referring to, but it looks like it can do what you want, or at least point you in the right direction.

Chris Thompson
You shouldn't have to reach down into HTML::Parser for this. There are many tools built on top of it that should be able to handle the job.
brian d foy
A: 

You can try something like this:

my $html = '<html code....';

$html =~ s/^.*(<table id="tablename">.*<\/table>).*/$1/s;
sitemap
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Leonardo Herrera
+10  A: 

HTML::TableExtract sounds exactly like what you are looking for.

Leon Timmermans
+2  A: 

Use HTML::Table.

Pradeep