I'm trying to parse* a large file (> 5GB) of structured markup data. The data format is essentially XML but there is no explicit root element. What's the most efficient way to do that?
The problem with SAX parsers is that they require a root element, so either I've to add a pseudo element to the data stream (is there an equivalent to Java's SequenceInputStream in Python?) or I've to switch to a non-SAX conform event-based parser (is there a successor of sgmllib?)
The structure of the data is quite simple. Basically a listing of elements:
<Document>
<docid>1</docid>
<text>foo</text>
</Document>
<Document>
<docid>2</docid>
<text>bar</text>
</Document>
*actually to iterate