ansaurus

Question

Answer 1

A:

Your line

Text[docno[i]] = node3.data

replaces the value of the mapping instead of appending the new one. Your <TEXT> node has both text and comment children, interleaved with each other.

Ignacio Vazquez-Abrams 2010-09-25 23:49:51

Answer 2

A:

DOM parser strips out the comments automatically for you. Each line is a Node.

So, You need to use:

Text[docno[i]]+= node3.data but before that you need to have an empty dictionary with all the keys. So, you can add Text[node3.data] = ''; in your first block of code.

So, your code becomes:

L = doc.getElementsByTagName("DOCNO")
for node2 in L:
    for node3 in node2.childNodes:
        if node3.nodeType == Node.TEXT_NODE:            
            docno.append(node3.data);
            Text[node3.data] = '';
        #print node2.data

L = doc.getElementsByTagName("TEXT")
i = 0
for node2 in L:
    for node3 in node2.childNodes:
        if node3.nodeType == Node.TEXT_NODE:
            Text[docno[i]]+= node3.data
    i = i+1

claws 2010-09-26 00:03:04

Answer 3

+4 A:

You could avoid looping through the doc twice by using xml.sax.handler:

import xml.sax.handler
import collections


class DocBuilder(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.state=''
        self.docno=''
        self.text=collections.defaultdict(list)
    def startElement(self, name, attrs):
        self.state=name
    def endElement(self, name):
        if name==u'TEXT':
            self.docno=''
    def characters(self,content):        
        content=content.strip()
        if content:
            if self.state==u'DOCNO':
                self.docno+=content
            elif self.state==u'TEXT':
                if content:
                    self.text[self.docno].append(content)


with open('test.xml') as f:
    data=f.read()            
builder = DocBuilder()
xml.sax.parseString(data, builder)
for key,value in builder.text.iteritems():
    print('{k}: {v}'.format(k=key,v=' '.join(value)))
# FR940104-2-00001: Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994

unutbu 2010-09-26 00:20:12

can we use lxml for SAX parser, or how lxml.sax differs from xml.sax ?

Tumbleweed 2010-09-27 13:21:15

@Tumbleweed: Yes, one could use lxml.sax.saxify instead. The syntax is almost exactly the same as for xml.sax, though you'd have to change `startElement` to `startElementNS` since lxml.sax supports namespace-aware processing only. See http://codespeak.net/lxml/sax.html

unutbu 2010-09-27 19:05:23

@Tumbleweed: Another option would be to use `lxml.etree.iterparse` or `lxml.etree.XMLParser` with a custom target. See Liza Daly's excellent article http://www.ibm.com/developerworks/xml/library/x-hiperfparse/#ibm-pcon for an example of how to do fast iterative parsing without building an entire parse tree in memory.

unutbu 2010-09-27 19:17:24

Answer 4

+1 A:

Using lxml:

import lxml.etree as le
with open('test.xml') as f:
    doc=le.parse(f)

texts={}
for docno in doc.xpath('DOCNO'):
    docno_text=docno.text.strip()    
    text=' '.join([t.strip() 
          for t in  docno.xpath('following-sibling::TEXT[1]/text()')
          if t.strip()])
    texts[docno.text]=text

print(texts)
# {'FR940104-2-00001': 'Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994'}

This version is a tad simpler than my first lxml solution. It handles multiple instances of DOCNO, TEXT nodes. The DOCNO/TEXT nodes should alternate, but in any case, the DOCNO is associated with the closest TEXT node that follows it.

unutbu 2010-09-26 02:07:27

Answer 5

+2 A:

Similar to unutbu's answer, though I think simpler:

from lxml import etree
with open('test.xml') as f:
    doc=etree.parse(f)

result={}
for elm in doc.xpath("/DOC[DOCNO]"):
    key = elm.xpath("DOCNO")[0].text.strip()
    value = "".join(t.strip() for t in elm.xpath("TEXT/text()") if t.strip())
    result[key] = value

The XPath that finds the DOC element in this example needs to be changed to be appropriate for your real document - e.g. if there's a single top-level element that all the DOC elements are children of, you'd change it to /*/DOC. The predicate on that XPath skips any DOC element that doesn't have a DOCNO child, which would otherwise cause an exception when setting the key.

Robert Rossney 2010-09-26 19:39:16

@Robert: Thanks for this. I think your version is not only simpler, it also (unlike my now deleted lxml-based answer) correctly handles adjacent `DOCNO` s with no `TEXT` in between.

unutbu 2010-09-26 19:55:50

+1 for [lxml](http://codespeak.net/lxml). Much better than python's xml support in the standard library.

ma3 2010-09-26 20:19:08

@unutbu: it actually doesn't handle adjacent `DOCNO`s at all. It finds `DOC` elements that have at least one `DOCNO` child. For each, it looks in the first `DOCNO` element to find the key. If there are multiple `DOCNO`s, it ignores all but the first. Also, if there are multiple `TEXT` children, it concatenates their text nodes together.

Robert Rossney 2010-09-26 21:55:20

@Robert: Suppose the xml had more than one pair of DOCNO and TEXT nodes. Do you see a way to modify your code to handle this case?

unutbu 2010-09-26 23:17:26

ansaurus

tags:

views:

answers:

Help with XML parsing in Python

related questions