ansaurus

Question

Parsing large pseudo-xml files in python

Answer 1

+9 A:

http://docs.python.org/library/xml.sax.html

Note, that you can pass a 'stream' object to xml.sax.parse. This means you can probably pass any object that has file-like methods (like read) to the parse call... Make your own object, which will firstly put your virtual root start-tag, then the contents of file, then virtual root end-tag. I guess that you only need to implement read method... but this might depend on the sax parser you'll use.

Example that works for me:

import xml.sax
import xml.sax.handler

class PseudoStream(object):
    def read_iterator(self):
        yield '<foo>'
        yield '<bar>'
        for line in open('test.xml'):
            yield line
        yield '</bar>'
        yield '</foo>'

    def __init__(self):
        self.ri = self.read_iterator()

    def read(self, *foo):
        try:
            return self.ri.next()
        except StopIteration:
            return ''

class SAXHandler(xml.sax.handler.ContentHandler):
    def startElement(self, name, attrs):
        print name, attrs

d = xml.sax.parse(PseudoStream(), SAXHandler())

liori 2009-10-02 11:26:55

Is `return ''` really the right thing to do on `StopIteration`? How would a client of that code notice the EOF if it only used `read()` then?

Joachim Sauer 2009-10-02 11:38:45

One of properties of stream-like objects in python is that a read() call either blocks and returns at least one byte, or in case of EOF, returns empty string. That's how the original file.read method works.

liori 2009-10-02 11:43:42

You might want to use this in conjunction with PullDOM - it combines the streaming nature of SAX with the hierarchical nature of DOM.

RichieHindle 2009-10-02 11:52:29

`ElementTree.iterparse()` could be used as more convenient alternative to SAX.

Denis Otkidach 2009-10-02 14:02:04

Answer 2

+1 A:

Hi,

The quick and dirty answer would be adding a root element (as String) so it would be a valid XML.

Regards.

ATorras 2009-10-02 11:39:00

Answer 3

+1 A:

Add root element and use SAX, STax or VTD-XML ..

vtd-xml-author 2009-10-02 17:49:17

Mr. Zhang - good answer. I've upvoted it.

John Saunders 2009-10-03 17:14:18

I linked the meta account with this one, where is the 100 points that you promised?

vtd-xml-author 2009-10-03 19:29:41

Answer 4

A:

xml.parsers.expat -- Fast XML parsing using Expat The xml.parsers.expat module is a Python interface to the Expat non-validating XML parser. The module provides a single extension type, xmlparser, that represents the current state of an XML parser. After an xmlparser object has been created, various attributes of the object can be set to handler functions. When an XML document is then fed to the parser, the handler functions are called for the character data and markup in the XML document.

More info : http://www.python.org/doc/2.5/lib/module-xml.parsers.expat.html

hemanth.hm 2009-10-02 18:17:01

ansaurus

tags:

views:

answers:

Parsing large pseudo-xml files in python

related questions