views:

197

answers:

2

HI, I am parsing an XML file using LibXML in Perl. The problem that I have is the ending characters (whitespace) is treated as a text node. For instance, given an input like the following

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE books [
    <!ELEMENT title  (#PCDATA)>
    <!ELEMENT author (#PCDATA)>
    <!ELEMENT year   (#PCDATA)>
    <!ELEMENT price  (#PCDATA)>
    <!ELEMENT book   (title, author, year, price)>
    <!ELEMENT books  (book*)>
]>
<books>
<book>
<title>Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
</books>

The parser thinks that the number of child of node "books" is 3, they are:

  • text node (containing the char between <books> and <book>)
  • element node of <book>
  • text node (containing the char between </book> and </books>)

Question is how do I tell LibXML to ignore whitespaces? I tried with no_blanks (that is $parser = XML::LibXML->new(no_blanks => 1) when construction the parser) but it seems that it has no effect.

Thanks in advance

+1  A: 

XML::LibXML::Parser has $parser->keep_blanks(0); . It's supposed to do the opposite of no_blanks - see if that works

DVK
Thanks for the suggestion but it does not help. I tried it on Linux and Cygwin.
Gilbeg
A: 

Strictly-speaking, XML::LibXML is doing the correct thing... there are three child-nodes of the <books> element. The question is, how are you parsing the content, and why is this a problem?

Assuming you've parsed your content and assigned the result to $document, you now have an instance of the XML::LibXML::Document class. Using this, you can get the <books> element by using documentElement():

$books = $document->documentElement();

This returns an instance of XML::LibXML::Element. From this, you can get just the <book> child-elements using getChildrenByTagName():

@book_elements = $books->getChildrenByTagName('book');

Does this help?

rjray
Hi,I pretty much did as what you mentioned. In the snippet version it is$dom = XML::LibXML->load_xml(location => "books.xml");$dom->validate();$root = $dom->documentElement();@x = $root->childNodes;The size of @X is 3. It seems that LibXML is broken. The call validate() Does validates the dom against the DTD. I know this because if I swap the oder of title and auther the parser complains. However, the parser failed to understand from the DTD that the child of books can only be title, author, year and price, no PCDATA at all. So where does this textnode come from ?
Gilbeg