tags:

views:

129

answers:

4

The XML Structure is as below:

<Entities>
    <Entity>
        <EntityName>.... </EntityName>
        <EntityType>.... </EntityType>
        <Tables>
            <DataTables>
                <DataTable>1</DataTable>
                <DataTable>2</DataTable>
                <DataTable>3</DataTable>
                <DataTable>4</DataTable>
            </DataTables>
            <OtherTables>
                <OtherTable>5</OtherTable>
                <OtherTable>6</OtherTable>
            </OtherTables>
        </Tables>
    </Entity>
.
.
.
</Entities>

I need to parse the file based on the Entity name selected and retrieve all the tables specifically in the order mentioned. How do I do this in Perl and which module should be used?

A: 

See : xml-simple

before using it, keep in mind, some points like

XML::Simple is able to present a simple API because it makes some assumptions on your behalf. These include:

  • You're not interested in text content consisting only of whitespace
  • You don't mind that when things get slurped into a hash the order is lost
  • You don't want fine-grained control of the formatting of generated XML
  • You would never use a hash key that was not a legal XML element name
  • You don't need help converting between different encodings

For event based parsing, use SAX (do not set out to write any new code for XML::Parser's handler API - it is obselete).

For tree-based parsing, you could choose between the 'Perlish' approach of XML::Twig and more standards based DOM implementations - preferably one with XPath support.

source: XML-Simple

For more detail about Perl-XML, see Perl-XML

Nikhil Jain
Thanks, but I had tried XML-simple. A reference said "the elements are in a different order, since hashes don't preserve the order of items they contain". So I doubt if the order of the tables will be maintained..
Abhi
@Abhi: That's true, XML-Simple assumed that when things get slurped in hash the order is lost.
Nikhil Jain
Bad about this answer: the advice to use XML::Simple. Very good: the explanation why it's bad.
reinierpost
+6  A: 

My favourite module to parse XML in Perl is XML::Twig (tutorial).

Code Sample:

use XML::Twig;

my $twig = XML::Twig->new(
    twig_handlers => {
        #calls the get_tables method for each Entity element
        Entity    => sub {get_tables($_);},
    },
    pretty_print  => 'indented',                # output will be nicely formatted
    empty_tags    => 'html',                    # outputs <empty_tag />
    keep_encoding => 1,
);

$twig->parsefile(xml-file);
$twig->flush;

sub get_tables {
    my $entity = shift;

    #Retrieves the sub-elements of DataTables
    my @data_tables = $entity->first_child("Tables")->children("DataTables");
    #Do stuff with the DataTables

    #Retrieves the sub-elements of OtherTables
    my @other_tables = $entity->first_child("Tables")->children("OtherTables");
    #Do stuff with the OtherTables

    #Flushes the XML element from memory
    $entity->purge;
}
Bart J
Also, the list of children of any element will be in document order ie. same as that in the xml file.
Bart J
+2  A: 

Document order is defined as

There is an ordering, document order, defined on all the nodes in the document corresponding to the order in which the first character of the XML representation of each node occurs in the XML representation of the document after expansion of general entities. Thus, the root node will be the first node. Element nodes occur before their children. Thus, document order orders element nodes in order of the occurrence of their start-tag in the XML (after expansion of entities).

In other words, the order in which things occur in the XML document. The XML::XPath module produces results in document order. For example:

#! /usr/bin/perl

use warnings;
use strict;

use XML::XPath;

my $entity_template = "/Entities"
                    . "/Entity"
                    .   "[EntityName='!!NAME!!']"
                    ;

my $tables_path = join "|" =>
                  qw( ./Tables/DataTables/DataTable
                      ./Tables/OtherTables/OtherTable );

my $xp = XML::XPath->new(ioref => *DATA);

foreach my $ename (qw/ foo bar /) {
  print "$ename:\n";
  (my $path = $entity_template) =~ s/!!NAME!!/$ename/g;
  foreach my $n ($xp->findnodes($path)) {
    foreach my $t ($xp->findnodes($tables_path, $n)) {
      print $t->toString, "\n";
    }
  }
}

__DATA__

The first expression searches for <Entity> elements where each has an <ElementName> child whose string-value is the Entity name selected. From there, we look for <DataTable> or <OtherTable>.

Given input of

<Entities>
    <Entity>
        <EntityName>foo</EntityName>
        <EntityType>type1</EntityType>
        <Tables>
            <DataTables>
                <DataTable>1</DataTable>
                <DataTable>2</DataTable>
            </DataTables>
            <OtherTables>
                <OtherTable>3</OtherTable>
                <OtherTable>4</OtherTable>
            </OtherTables>
        </Tables>
    </Entity>
    <Entity>
        <EntityName>bar</EntityName>
        <EntityType>type2</EntityType>
        <Tables>
            <DataTables>
                <DataTable>5</DataTable>
                <DataTable>6</DataTable>
            </DataTables>
            <OtherTables>
                <OtherTable>7</OtherTable>
                <OtherTable>8</OtherTable>
            </OtherTables>
        </Tables>
    </Entity>
</Entities>

the output is

foo:
<DataTable>1</DataTable>
<DataTable>2</DataTable>
<OtherTable>3</OtherTable>
<OtherTable>4</OtherTable>
bar:
<DataTable>5</DataTable>
<DataTable>6</DataTable>
<OtherTable>7</OtherTable>
<OtherTable>8</OtherTable>

To extract the string-values (the “inner text”), change $tables_path to

my $tables_path = ". / Tables / DataTables  / DataTable  / text() |
                   . / Tables / OtherTables / OtherTable / text()";

Yes, that's repetitive—because XML::XPath implements XPath 1.0.

Output:

foo:
1
2
3
4
bar:
5
6
7
8
Greg Bacon
Hi.. how can I get only the values using XPath? For Ex: 1 2 3 4
Abhi
@Abhi See updated answer.
Greg Bacon
A: 

I prefer XML::LibXML, which allows you (and me) to use XPath to select elements.

You may wish to look at a script I wrote with it.

reinierpost