tags:

views:

374

answers:

4

I'm trying to parse* a large file (> 5GB) of structured markup data. The data format is essentially XML but there is no explicit root element. What's the most efficient way to do that?

The problem with SAX parsers is that they require a root element, so either I've to add a pseudo element to the data stream (is there an equivalent to Java's SequenceInputStream in Python?) or I've to switch to a non-SAX conform event-based parser (is there a successor of sgmllib?)

The structure of the data is quite simple. Basically a listing of elements:

<Document>
  <docid>1</docid>
  <text>foo</text>
</Document>
<Document>
  <docid>2</docid>
  <text>bar</text>
</Document>

*actually to iterate

+9  A: 

http://docs.python.org/library/xml.sax.html

Note, that you can pass a 'stream' object to xml.sax.parse. This means you can probably pass any object that has file-like methods (like read) to the parse call... Make your own object, which will firstly put your virtual root start-tag, then the contents of file, then virtual root end-tag. I guess that you only need to implement read method... but this might depend on the sax parser you'll use.

Example that works for me:

import xml.sax
import xml.sax.handler

class PseudoStream(object):
    def read_iterator(self):
        yield '<foo>'
        yield '<bar>'
        for line in open('test.xml'):
            yield line
        yield '</bar>'
        yield '</foo>'

    def __init__(self):
        self.ri = self.read_iterator()

    def read(self, *foo):
        try:
            return self.ri.next()
        except StopIteration:
            return ''

class SAXHandler(xml.sax.handler.ContentHandler):
    def startElement(self, name, attrs):
        print name, attrs

d = xml.sax.parse(PseudoStream(), SAXHandler())
liori
Is `return ''` really the right thing to do on `StopIteration`? How would a client of that code notice the EOF if it only used `read()` then?
Joachim Sauer
One of properties of stream-like objects in python is that a read() call either blocks and returns at least one byte, or in case of EOF, returns empty string. That's how the original file.read method works.
liori
You might want to use this in conjunction with PullDOM - it combines the streaming nature of SAX with the hierarchical nature of DOM.
RichieHindle
`ElementTree.iterparse()` could be used as more convenient alternative to SAX.
Denis Otkidach
+1  A: 

Hi,

The quick and dirty answer would be adding a root element (as String) so it would be a valid XML.

Regards.

ATorras
+1  A: 

Add root element and use SAX, STax or VTD-XML ..

vtd-xml-author
Mr. Zhang - good answer. I've upvoted it.
John Saunders
I linked the meta account with this one, where is the 100 points that you promised?
vtd-xml-author
A: 

xml.parsers.expat -- Fast XML parsing using Expat The xml.parsers.expat module is a Python interface to the Expat non-validating XML parser. The module provides a single extension type, xmlparser, that represents the current state of an XML parser. After an xmlparser object has been created, various attributes of the object can be set to handler functions. When an XML document is then fed to the parser, the handler functions are called for the character data and markup in the XML document.

More info : http://www.python.org/doc/2.5/lib/module-xml.parsers.expat.html

hemanth.hm