ansaurus

Question

Answer 1

+12 A:

Never ever use Regex to handle markup languages.

The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:

XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.

so I made a new version that uses XML::LibXML (thanks, Grant):

use warnings;
use strict;
use XML::LibXML;

my $doc   = XML::LibXML->load_xml(location => 'articles.xml');
my $xp    = XML::LibXML::XPathContext->new($doc->documentElement);
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath) ) {
  # now do something with $article
  print $article.": ".$article->getName."\n";
}

For me this prints:

XML::LibXML::Element=SCALAR(0x346ef90): article
XML::LibXML::Element=SCALAR(0x346ef30): article
XML::LibXML::Element=SCALAR(0x346efa8): article

Links to the relevant documentation:

The type of $doc will be XML::LibXML::Document.
The type of $xp is XML::LibXML::XPathContext.
The return type of $xp->findnodes() is XML::LibXML::NodeList.
The type $article is XML::LibXML::Element.

Original version of the answer, based on the XML::XPath package:

use warnings;
use strict;
use XML::XPath;

my $xp    = XML::XPath->new(filename => 'articles.xml');
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) {
  # now do something with $article
  print $article.": ".$article->getName ."\n";
}

which prints this for me:

XML::XPath::Node::Element=REF(0x38067b8): article
XML::XPath::Node::Element=REF(0x38097e8): article
XML::XPath::Node::Element=REF(0x3809ae8): article

The type of $xp is XML::XPath, obviously.
The return type of $xp->findnodes() is XML::XPath::NodeSet.
The type of $article will be XML::XPath::Node::Element in this case.

Have a look at the docs to find out what you can do with them.

Tomalak 2010-06-03 09:32:32

This is one case where a regex could easily do the job though.

Snake Plissken 2010-06-03 11:15:58

@Snake Plissken: No, it isn't. Regex is *never* the right tool for that kind of job, no matter how "easy" it seems. XPath+Programming Language X (Perl in this case) is, or XSLT is. Regex is not.

Tomalak 2010-06-03 11:20:18

You're being silly. In this case a regex can easily do the job. What are you going to do in the case that someone asks you to copy a non-XML file until something has been seen three times?

Snake Plissken 2010-06-03 11:26:24

I guess there're exceptions from the rule. This will be just a simple job so i guess xml will handle it, i'll not use regex for some hardcore html/xml parsing though.

dusker 2010-06-03 13:20:39

BTW i tried printing $article in foreach loop but it doesn't print anything

dusker 2010-06-03 13:28:37

@Snake Plissken: I'm not being silly. I'm just trying to avoid being smart about when to use a proper parser. There is a nice XML parser built into Perl, there is absolutely no reason not to use it. (It's not "oh damn, I have to use a parser because this is too complex for regex", it's "oh damn, I can't use a parser because the language I use does not supply one". And the latter is almost never true.)

Tomalak 2010-06-03 13:49:56

Now it's kind of working, when i try to print the contents of $article, then it prints but omits all the tags in between. I'd like it to copy all that's inside tag <article> along with values and other tqgs

dusker 2010-06-03 16:40:10

Agreed here with Tomalak. Regexp are fine for some cases. Parsing XML is not one of them.

Robert P 2010-06-03 23:49:53

FYI, XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.

Grant McLean 2010-06-04 01:03:38

XML::Twig is another good XML module I can recommend.

Snake Plissken 2010-06-04 03:14:40

@Grant McLean: I've made a new version that uses `XML::LibXML`. Please have a look and comment on anything I could improve.

Tomalak 2010-06-04 11:37:55

Answer 2

A:

Here:

 open my $input, "<", "file.xml" or die $!;
 open my $output, ">", "truncated-file.xml" or die $!;
 my $n_articles = 0;
 while (<$input>) {
      print $output $_;
      if (m:</article>:) {
           $n_articles++;
           if ($n_articles >= 3) {
                last;
           }
      }
 }         
 close $input or die $!;
 close $output or die $!;

You really don't need an XML parser to do such a simple job.

Snake Plissken 2010-06-03 11:24:06

What that script did is it copied all the contents of the file.xml into truncated-file.xml

dusker 2010-06-03 13:19:24

Then it's debugging time for you. Anyway there is another answer you can use if this doesn't work.

Snake Plissken 2010-06-04 03:16:40

Would you mind sharing that another solution?thanks

dusker 2010-06-04 05:33:27

I was referring to the other answer on this thread: http://stackoverflow.com/questions/2964637/parsing-xml-file-with-perl-regex/2964681#2964681

Snake Plissken 2010-06-04 07:35:28

ansaurus

tags:

views:

answers:

Parsing XML file with perl - regex

related questions