tags:

views:

294

answers:

4

I am new to XML, and DOM. I guess I need to use DOM API to find go through every non-text nodes once, and output the node name.

say I got this example XML from W3C

<bookstore>

<book category="cooking">
 <title lang="en">Everyday Italian</title>
 <author>Giada De Laurentiis</author>
 <year>2005</year>
 <price>30.00</price>
 <page pagenumber="550"/>
</book>

<book category="children">
 <title lang="en">Harry Potter</title>
 <author>J K. Rowling</author>
 <year>2005</year>
 <price>29.99</price>
 <page pagenumber="500"/>
</book>
</bookstore>

I need to find node such as <page pagenumber="500" /> which is a non-text node

How can I do that? seduo-code would be fine too. Thanks

can I say

 while (x.nodeValue == NULL) {
   read the next node ?
}

I guess I should make myself clear, no assumption on any docuemnts. This should work on all XML as long as there is a non-text node. I guess this should be done in the order from top-down and from left to right for every nodes. :(

A: 

What do you know about the node you need to find? If you know exactly that it's:

  • A page element
  • It has a pagenumber attribute with value 500

then XPath is the way forward (assuming it's available on your platform - you haven't specified beyond "DOM"; most DOM implementations include XPath as far as I've seen).

In this case you'd use an XPath of:

//page[@pagenumber='500']

If you can't use XPath, please explain which DOM API you're using and we can try to come up with the best solution. Basically you'll probably end up iterating over every element node, checking whether its name is page and then checking whether it has an appropriate pagenumber attribute value.

Jon Skeet
well I guess I don't know what's the node and its attribute yet in this case. So what can I do than?
Jonathan
So what *do* you know? What differentiates "the node that you want" from one that you don't?
Jon Skeet
I don't know yet this thing should suppose to work on all XML documents that have non-text node. So I guess I have no pre-assumption on what node will coming next. As long as its a non-text node in side one XML documents I want to find it
Jonathan
@Jon: The property that differentiates the page node is that it contains no textual child elements. Please see my answer. I do however, agree that knowledge of the DOM API being used here is possibly important for an accurate answer.
Cerebrus
Ah. There's a huge difference between "node which doesn't contain any text nodes" and "non-text nodes". An element is a non-text node, even if it *contains* text. Deleting this answer as it's pointless now...
Jon Skeet
A: 

Looks like you'll be needing an XPath. The W3 Schools site has a good reference, but, assuming the node always appears under a node, the XPath /bookstore/book/page will return a node set with each node in it. /bookstore/book/page[@pagenumber='500'] will get each node where the pagenumber attribute has a value of 500.

The // syntax will find the node anywhere in the document without worrying about structure - this can be easier but is slower, especially with large documents. If you have a document with a known structure, it's best to use the explicit XPath.

Graham Clark
thanks but I don't know what's the node is going to be like. I guess that's why I need to use DOM
Jonathan
+1  A: 

Your question basically seems to be : Given an XML document, How do I find child nodes that do not have any text-content.

A simple XPath expression such as:

/bookstore/book/*[count(child::text()) = 0]

or

/bookstore/book/*[not(text())]

will do it for you. Applying this XPath expression on the sample document will return a node-set containing both the page elements. You do not have to know the name of the page element beforehand, or even the names of all possible child elements of the book element, as you can see.

To explain: You need to query for child-nodes of the book element that do not contain ANY textual child nodes. The child::* axis represents all child nodes of the current node and the text() node-type restricts the processed node types to those that contain textual content.

Edit: Note that if you want to query for non-text nodes in any XML document (as per your latest edit to the question), you should choose the answer provided by nils_gate. My answer was given prior to your edit and illustrates the concept, rather than providing a generic solution.

Cerebrus
This seems to do the job. However, I can't figure out why would you have '/bookstore/book' in your XPath expression. FJ states the bookstore XML is an example XML doc from W3C and that he wants to do this for ANY xml. So wouldn't the solution be something like //*[count(child::text()) = 0] ?
Peter Perháč
Good point, @Master. Apparently, FJ made that edit to his post *after* I had posted my answer.
Cerebrus
+2  A: 

XPATH ="//*[not(text())]"
Will select all nodes which are non-text node.
Here in the given example: bookstore and book are also non-text nodes as they does not have any text of their own, though their children do have text.

nils_gate
Good one. That's another (possibly more straightforward) way of writing the XPath.
Cerebrus