views:

350

answers:

4

I have HTML content stored in a variable. How do I extract data that is found between a set of common tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:

...
<td class="jumlah">*DATA_1*</td>
<td class="ud"><a href="">*DATA_2*</a></td>
...

And then I would like to store a mapping DATA_2 => DATA_1 in a hash

+1  A: 

Since it is HTML I think this could work for you?

http://search.cpan.org/~msergeant/XML-XPath-1.13/XPath.pm

XPath is the way.

dierre
Isn't XPATH limited to XML (and thus XHTML only)? I have very limited experience with it, but never saw it used to handle non-X HTML
DVK
@DVK: I wouldn't put it past an XPath module developed in Perl to try to be a little more clever.
Axeman
@Axeman - touche :)
DVK
I've always used the HTML::TreeBuilder::XPath library when using XPath to query HTML documents (http://search.cpan.org/~mirod/HTML-TreeBuilder-XPath-0.11/lib/HTML/TreeBuilder/XPath.pm). It's been pretty robust as far as I can tell (I've scraped tens of thousands of business locations from certain sites using it).
jasonmp85
I wanted to link you the HTML::TreeBuilder::XPath but I got it wrong when copying the link from google. I'm sorry.
dierre
@dierre: +1 for trying. :)
Axeman
A: 

Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.

Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.

DVK
It might be theoretically impossible - HTML isn't a regular language. If his query is "regular", it would be possible.
Paul Nathan
A: 

You might try this module: HTML::TreeBuilder::XPath. The doc says:

This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

Axeman
+1  A: 

Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.

First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:

my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);

Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:

my $tdNodes = $tree->findnodes('/html/body/table/tr/td');

Finally you can just iterate over all the nodes in a loop to find what you want:

foreach my $node ($tdNodes->get_nodelist) {
  my $data = $node->findvalue('.'); // the content of the node
  print "$data\n";
}

See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.

With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.

jasonmp85
+1 for having example code.
Kinopiko