tags:

views:

1920

answers:

2

I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.

from xml.dom import minidom

xml = """<?xml version="1.0" ?>
<ProductData>
    <ITEM Id="0471195">
     <Category>
            <![CDATA[Homogenizers]]>     
        </Category>
     <Image>
      0471195.jpg
     </Image>
    </ITEM>
    <ITEM Id="0471195">
     <Category>
            <![CDATA[Homogenizers]]>     
        </Category>
     <Image>
      0471196.jpg
     </Image>
    </ITEM>
</ProductData>
"""

bad_xml_item_count = 0
data = {}
xml_data = minidom.parseString(xml).getElementsByTagName('ProductData')
parts = xml_data[0].getElementsByTagName('ITEM')
for p in parts:
    try:
        part_id = p.attributes['Id'].value.strip()
    except(KeyError):
        bad_xml_item_count += 1
        continue
    if not part_id:
        bad_xml_item_count += 1
        continue
    part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()
    part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()
    print '\t'.join([part_id, part_category, part_image])
+4  A: 

p.getElementsByTagName('Category')[0].firstChild

minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it's worth, but minidom is much older than DOM L3.)

So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.

What you probably want is the textual data of all children of Category. In DOM Level 3 Core you'd just call:

p.getElementsByTagName('Category')[0].textContent

but minidom doesn't support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:

p.getElementsByTagName('Category')[0].firstChild.wholeText
bobince
+1  A: 

CDATA is its own node, so the Category elements here actually have three children, a whitespace text node, the CDATA node, and another whitespace node. You're just looking at the wrong one, is all. I don't see any more obvious way to query for the CDATA node, but you can pull it out like this:

[n for n in category.childNodes if n.nodeType==category.CDATA_SECTION_NODE][0]
ironfroggy