ansaurus

Question

Answer 1

+1 A:

getElementsByTagName is recursive, you'll get all descendents with a matching tagName. Because your Topics contain other Topics that also have Titles, the call will get the lower-down Titles many times.

If you want to ask for all matching direct children only, and you don't have XPath available, you can write a simple filter, eg.:

def getChildrenByTagName(node, tagName):
    for child in node.childNodes:
        if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
            yield child

for topic in document.getElementsByTagName('Topic'):
    title= list(getChildrenByTagName('Title'))[0]         # or just get(...).next()
    print title.firstChild.data

bobince 2009-10-20 22:17:38

Thanks for the attempt. It didn't work but it gave me some ideas. The following works (the same general idea; FWIW, the nodeType is ELEMENT_NODE):import xml.dom.minidomfrom xml.dom.minidom import Nodedom = xml.dom.minidom.parse("docmap.xml")def getChildrenByTitle(node): for child in node.childNodes: if child.localName=='Title': yield childTopic=dom.getElementsByTagName('Topic')for node in Topic: alist=getChildrenByTitle(node) for a in alist:# Title= a.firstChild.data Title= a.childNodes[0].nodeValue print Title

hWorks 2009-10-21 00:03:13

Oops yes, I meant ELEMENT not TEXT of course! doh, fixed

bobince 2009-10-21 02:12:54

Answer 2

+1 A:

Let me put that comment here ...

Thanks for the attempt. It didn't work but it gave me some ideas. The following works (the same general idea; FWIW, the nodeType is ELEMENT_NODE):

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("docmap.xml")

def getChildrenByTitle(node):
    for child in node.childNodes:
        if child.localName=='Title':
            yield child

Topic=dom.getElementsByTagName('Topic')
for node in Topic:
    alist=getChildrenByTitle(node)
    for a in alist:
#        Title= a.firstChild.data
        Title= a.childNodes[0].nodeValue
        print Title

hWorks 2009-10-21 00:04:10

I would call the function getTitle (or `get_title`), and have it not return all immediate child Title elements, but just the first one (as there should be just one title per child, anyway).

Martin v. Löwis 2009-10-21 03:52:27

Maybe this is what I'm not getting. I want the titles of all immediate children. Maybe a better name would be getTitlesOfChildren.

hWorks 2009-10-21 16:37:46

Answer 3

+3 A:

You could use the following generator to run through the list and get titles with indentation levels:

def f(elem, level=-1):
    if elem.nodeName == "Title":
        yield elem.childNodes[0].nodeValue, level
    elif elem.nodeType == elem.ELEMENT_NODE:
        for child in elem.childNodes:
            for e, l in f(child, level + 1):
                yield e, l

If you test it with your file:

import xml.dom.minidom as minidom
doc = minidom.parse("test.xml")
list(f(doc))

you will get a list with the following tuples:

(u'My Document', 1), 
(u'Overview', 1), 
(u'Basic Features', 2), 
(u'About This Software', 2), 
(u'Platforms Supported', 3)

It is only a basic idea to be fine-tuned of course. If you just want spaces at the beginning you can code that directly in the generator, though with the level you have more flexibility. You could also detect the first level automatically (here it's just a poor job of initializing the level to -1...).

RedGlyph 2009-10-21 18:45:23

Exactly what I've been trying to do all day before coming upon generators. Many thanks.

hWorks 2009-10-21 21:42:20

ansaurus

tags:

views:

answers:

XML Parsing with Python and minidom

related questions