I'm defining an xml schema of my own which supports the additional tag "insert_tag", which when reached should insert the text file at that point in the stream and then continue the parsing:
Here is an example:
my.xml:
<xml>
Something
<insert_file name="foo.html"/>
or another
</xml>
I'm using xmlreader
as follows:
class HtmlHandler(xml.sax.handler.ContentHandler): def __init__(self): xml.sax.handler.ContentHandler.__init__(self) parser = xml.sax.make_parser() parser.setContentHandle(HtmlHandler()) parser.parse(StringIO(html))
The question is how do I insert the included contents directly into the parsing stream? Of course I could recursively build up the non-interpolated text by repeatedly inserting included text, but that means that I have to parse the xml multiple times.
I tried replacing StringIO(html) with my own stream that allows inserting contents mid stream, but it doesn't work because the sax parser reads the stream buffered.
Update:
I did find a solution that is hackish at the best. It is built on the following stream class:
class InsertReader(): """A reader class that supports the concept of pushing another reader in the middle of the use of a first reader. This may be used for supporting insertion commands.""" def __init__(self): self.reader_stack = [] def push(self,reader): self.reader_stack += [reader] def pop(self): self.reader_stack.pop() def __iter__(self): return self def read(self,n=-1): """Read from the top most stack element. Never trancends elements. Should it? The code below is a hack. It feeds only a single token back to the reader. """ while len(self.reader_stack)>0: # Return a single token ret_text = StringIO() state = 0 while 1: c = self.reader_stack[-1].read(1) if c=='': break ret_text.write(c) if c=='>': break ret_text = ret_text.getvalue() if ret_text == '': self.reader_stack.pop() continue return ret_text return '' def next(self): while len(self.reader_stack)>0: try: v = self.reader_stack[-1].next() except StopIteration: self.reader_stack.pop() continue return v raise StopIteration
This class creates a stream structure that restricts the amount of characters that are returned to the user of the stream. I.e. even if the xml parser does read(16386) the class will only return bytes up to the next '>' character. Since the '>' character also signifies the end of tags, we have the opportunity to inject our recursive include into the stream at this point.
What is hackish about this solution is the following:
- Reading one character at a time from a stream is slow.
- This has an implicit assumption about how the sax stream class is reading text.
This solves the problem for me, but I'm still interested in a more beautiful solution.