views:

458

answers:

3

I am trying to use XPath to extract some HTML tags and data and for that I need to use XML::LibXML module.

I tried installing it from CPAN shell but it doesn't install.

I followed the instructions from CPAN site about the installation, that we need to install libxml2, iconv and zlib wrappers before installing XML::LibXML and it didn't work out.

Also, if there is any other simpler module that gets my task done, please let me know.

The task at hand:

I am searching for a specific <dd> tag on a html page which is really big ( around 5000 - 10000) <dd> and <dt> tags. So, I am writing a script which matches the content within <dd> tag and fetches the content within the corresponding (next) <dt> tag.

I wish i could i have been a little more clearer. Any help is greatly appreciated.

+1  A: 

If you just want XPath queries then I just wrote a script yesterday that uses XML::XPath::XMLParser to do XPath queries on an xml file.

I have tested it with both Activestate's perl installation and with strawberry perl on windows.

I don't remember having to go to cpan to install any modules( though I may have earlier and forgot doing so:)), so perhaps you can use the XML::XPath module instead?

Here is the sample from the documentation

use XML::XPath;
use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => 'test.xhtml');

my $nodeset = $xp->find('/html/body/p'); # find all paragraphs

foreach my $node ($nodeset->get_nodelist) {
    print "FOUND\n\n", 
        XML::XPath::XMLParser::as_string($node),
        "\n\n";
}
chollida
Since it is unlikely you will get Win32 versions of libxml2, iconv, and zlib (although they exist, see http://gnuwin32.sourceforge.net/packages.html for example) to work with the XML::LibXml module, I think chollida's approach sounds better.
ewall
@ewall - give some context. chollida's approach is better than what?
ysth
A: 

Assuming that you are using ActiveState Perl, you can get XML::LibXML working just fine. You can get XML::LibXML from Randy Kobes' site and you get libxslt/libxml, etc from zlatkovic.com

I just install libxml first and then use ppm to install XML::LibXML. Works just fine.

If you are using Strawberry Perl, CPAN should work for you as libxml2, etc are part of the Strawberry Perl distribution I believe.

Nic Gibson
+4  A: 

If you are using ActiveState Perl, you should add the repositories listed at ActivePerl 10xx Win32 PPM packages to ppm and then use

ppm install XML::LibXML

Trying to parse HTML as XML is generally not a pleasant task. I think HTML::TokeParser is more suitable to the task.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;

my $p = HTML::TokeParser->new(\*DATA);

my @definitions;

while ( my $dl_tag = $p->get_tag('dl') ) {
    while ( my $dt_tag = $p->get_tag('dt') ) {
        my $term = $p->get_trimmed_text('/dt');
        my $dd_tag = $p->get_tag('dd');
        my $defn = $p->get_trimmed_text('/dd');
        push @definitions, [$term, $defn];
    }
}

use Data::Dumper;
print Dumper \@definitions;

__DATA__
<dl>
<dt>One</dt>
<dd>1</dd>
<dt>Two</dt>
<dd>2</dd>
</dl>

Output:

$VAR1 = [
          [
            'One',
            '1'
          ],
          [
            'Two',
            '2'
          ]
        ];
Sinan Ünür