ansaurus

Question

Answer 1

+1 A:

Since it is HTML I think this could work for you?

http://search.cpan.org/~msergeant/XML-XPath-1.13/XPath.pm

XPath is the way.

dierre 2010-05-21 23:23:05

Isn't XPATH limited to XML (and thus XHTML only)? I have very limited experience with it, but never saw it used to handle non-X HTML

DVK 2010-05-21 23:31:48

@DVK: I wouldn't put it past an XPath module developed in Perl to try to be a little more clever.

Axeman 2010-05-21 23:33:35

@Axeman - touche :)

DVK 2010-05-21 23:34:50

I've always used the HTML::TreeBuilder::XPath library when using XPath to query HTML documents (http://search.cpan.org/~mirod/HTML-TreeBuilder-XPath-0.11/lib/HTML/TreeBuilder/XPath.pm). It's been pretty robust as far as I can tell (I've scraped tens of thousands of business locations from certain sites using it).

jasonmp85 2010-05-21 23:44:00

I wanted to link you the HTML::TreeBuilder::XPath but I got it wrong when copying the link from google. I'm sorry.

dierre 2010-05-21 23:57:57

@dierre: +1 for trying. :)

Axeman 2010-05-22 00:42:12

Answer 2

A:

Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.

Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.

DVK 2010-05-21 23:31:28

It might be theoretically impossible - HTML isn't a regular language. If his query is "regular", it would be possible.

Paul Nathan 2010-05-21 23:48:38

Answer 3

A:

You might try this module: HTML::TreeBuilder::XPath. The doc says:

This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

Axeman 2010-05-21 23:38:36

Answer 4

+1 A:

Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.

First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:

my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);

Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:

my $tdNodes = $tree->findnodes('/html/body/table/tr/td');

Finally you can just iterate over all the nodes in a loop to find what you want:

foreach my $node ($tdNodes->get_nodelist) {
  my $data = $node->findvalue('.'); // the content of the node
  print "$data\n";
}

See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.

With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.

jasonmp85 2010-05-21 23:42:43

+1 for having example code.

Kinopiko 2010-05-21 23:53:24

ansaurus

tags:

views:

answers:

Grep and Extract Data in Perl

related questions