ansaurus

Question

Is there a fast XML parser in Python that allows me to get start of tag as byte offset in stream?

Answer 1

+1 A:

Since locators return line and column numbers in lieu of offset, you need a little wrapping to track line ends -- a simplified example (could have some offbyones;-)...:

import cStringIO
import re
from xml import sax
from xml.sax import handler

relinend = re.compile(r'\n')

txt = '''<foo>
            <tit>Bar</tit>
        <baz>whatever</baz>
     </foo>'''
stm = cStringIO.StringIO(txt)

class LocatingWrapper(object):
    def __init__(self, f):
        self.f = f
    self.linelocs = []
    self.curoffs = 0
    def read(self, *a):
        data = self.f.read(*a)
    linends = (m.start() for m in relinend.finditer(data))
    self.linelocs.extend(x + self.curoffs for x in linends)
    self.curoffs += len(data)
    return data
    def where(self, loc):
        return self.linelocs[loc.getLineNumber() - 1] + loc.getColumnNumber()

locstm = LocatingWrapper(stm)

class Handler(handler.ContentHandler):
    def setDocumentLocator(self, loc):
        self.loc = loc
    def startElement(self, name, attrs):
        print '%s@%s:%s (%s)' % (name, 
                                 self.loc.getLineNumber(),
                                 self.loc.getColumnNumber(),
                                 locstm.where(self.loc))

sax.parse(locstm, Handler())

Of course you don't need to keep all of the linelocs around -- to save memory, you can drop "old" ones (below the latest one queried) but then you need to make linelocs a dict, etc.

Alex Martelli 2010-07-06 16:30:07

Thanks. I got my XML indexer working with this code. I am waiting witch accepting to see if there are any answers that use a faster parser. Let me know if you would like to see it for an addition to the cookbook.

James Dean 2010-07-08 19:56:48

@James, your Q explicitly said you wanted to use SAX, so I'm confused that you're now looking for other parsers within the same question (?). As for the Cookbook, thanks for offering, but I'm not currently maintaining the future edition (actually I don't know who is... if anybody... besides the online stuff at activestate of course, which isn't really gatewayed by anybody in particular and never has been).

Alex Martelli 2010-07-08 20:14:31

@Alex, when I said SAX I just meant a parser that does not load the whole document into memory.With fast I meant a parser with a speed similar to cElementTree.iterparse: http://effbot.org/zone/celementtree.htm. I am working with large documents so if I can index it 4 times faster then that would be worth it.

James Dean 2010-07-08 21:00:11

@James, if you want "any fast incremental parser" I suggest not saying "SAX" instead (which more or less respects a standard == overhead;-). Anyway, you can sure use http://docs.python.org/library/xml.etree.elementtree.html?highlight=iterparse#xml.etree.ElementTree.iterparse ... but there's no `locator` concept in etree (AFAIK). http://docs.python.org/library/pyexpat.html does offer a current byte index, http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.CurrentByteIndex , so you could try that (if you hadn't said "SAX", I'd have suggested that first!-).

Alex Martelli 2010-07-08 21:49:44

ansaurus

tags:

views:

answers:

Is there a fast XML parser in Python that allows me to get start of tag as byte offset in stream?

related questions