tags:

views:

184

answers:

2

Hi Everyone, i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one. Here's an example of an xml file:

    <article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>

What i'd like to do is to get first 3 items along with all the tags in between and put it into another file. Thanks for all the help in advance regards peter

+12  A: 

Never ever use Regex to handle markup languages.

The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:

XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.

so I made a new version that uses XML::LibXML (thanks, Grant):

use warnings;
use strict;
use XML::LibXML;

my $doc   = XML::LibXML->load_xml(location => 'articles.xml');
my $xp    = XML::LibXML::XPathContext->new($doc->documentElement);
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath) ) {
  # now do something with $article
  print $article.": ".$article->getName."\n";
}

For me this prints:

XML::LibXML::Element=SCALAR(0x346ef90): article
XML::LibXML::Element=SCALAR(0x346ef30): article
XML::LibXML::Element=SCALAR(0x346efa8): article

Links to the relevant documentation:


Original version of the answer, based on the XML::XPath package:

use warnings;
use strict;
use XML::XPath;

my $xp    = XML::XPath->new(filename => 'articles.xml');
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) {
  # now do something with $article
  print $article.": ".$article->getName ."\n";
}

which prints this for me:

XML::XPath::Node::Element=REF(0x38067b8): article
XML::XPath::Node::Element=REF(0x38097e8): article
XML::XPath::Node::Element=REF(0x3809ae8): article

Have a look at the docs to find out what you can do with them.

Tomalak
This is one case where a regex could easily do the job though.
Snake Plissken
@Snake Plissken: No, it isn't. Regex is *never* the right tool for that kind of job, no matter how "easy" it seems. XPath+Programming Language X (Perl in this case) is, or XSLT is. Regex is not.
Tomalak
You're being silly. In this case a regex can easily do the job. What are you going to do in the case that someone asks you to copy a non-XML file until something has been seen three times?
Snake Plissken
I guess there're exceptions from the rule. This will be just a simple job so i guess xml will handle it, i'll not use regex for some hardcore html/xml parsing though.
dusker
BTW i tried printing $article in foreach loop but it doesn't print anything
dusker
@Snake Plissken: I'm not being silly. I'm just trying to avoid being smart about when to use a proper parser. There is a nice XML parser built into Perl, there is absolutely no reason not to use it. (It's not "oh damn, I have to use a parser because this is too complex for regex", it's "oh damn, I can't use a parser because the language I use does not supply one". And the latter is almost never true.)
Tomalak
Now it's kind of working, when i try to print the contents of $article, then it prints but omits all the tags in between. I'd like it to copy all that's inside tag <article> along with values and other tqgs
dusker
Agreed here with Tomalak. Regexp are fine for some cases. Parsing XML is not one of them.
Robert P
FYI, XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.
Grant McLean
XML::Twig is another good XML module I can recommend.
Snake Plissken
@Grant McLean: I've made a new version that uses `XML::LibXML`. Please have a look and comment on anything I could improve.
Tomalak
A: 

Here:

 open my $input, "<", "file.xml" or die $!;
 open my $output, ">", "truncated-file.xml" or die $!;
 my $n_articles = 0;
 while (<$input>) {
      print $output $_;
      if (m:</article>:) {
           $n_articles++;
           if ($n_articles >= 3) {
                last;
           }
      }
 }         
 close $input or die $!;
 close $output or die $!;

You really don't need an XML parser to do such a simple job.

Snake Plissken
What that script did is it copied all the contents of the file.xml into truncated-file.xml
dusker
Then it's debugging time for you. Anyway there is another answer you can use if this doesn't work.
Snake Plissken
Would you mind sharing that another solution?thanks
dusker
I was referring to the other answer on this thread: http://stackoverflow.com/questions/2964637/parsing-xml-file-with-perl-regex/2964681#2964681
Snake Plissken