tags:

views:

71

answers:

1

I'm defining an xml schema of my own which supports the additional tag "insert_tag", which when reached should insert the text file at that point in the stream and then continue the parsing:

Here is an example:

my.xml:

<xml> Something <insert_file name="foo.html"/> or another </xml>

I'm using xmlreader as follows:

 class HtmlHandler(xml.sax.handler.ContentHandler):

    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)

 parser = xml.sax.make_parser()
 parser.setContentHandle(HtmlHandler())

 parser.parse(StringIO(html))

The question is how do I insert the included contents directly into the parsing stream? Of course I could recursively build up the non-interpolated text by repeatedly inserting included text, but that means that I have to parse the xml multiple times.

I tried replacing StringIO(html) with my own stream that allows inserting contents mid stream, but it doesn't work because the sax parser reads the stream buffered.

Update:

I did find a solution that is hackish at the best. It is built on the following stream class:

class InsertReader():
    """A reader class that supports the concept of pushing another
    reader in the middle of the use of a first reader. This may
    be used for supporting insertion commands."""
    def __init__(self):
        self.reader_stack = []

    def push(self,reader):
        self.reader_stack += [reader]

    def pop(self):
        self.reader_stack.pop()

    def __iter__(self):
        return self

    def read(self,n=-1):
        """Read from the top most stack element. Never trancends elements.
        Should it?

        The code below is a hack. It feeds only a single token back to
        the reader.
        """
        while len(self.reader_stack)>0:
            # Return a single token
            ret_text = StringIO()
            state = 0
            while 1:
                c = self.reader_stack[-1].read(1)
                if c=='':
                    break

                ret_text.write(c)
                if c=='>':
                    break

            ret_text = ret_text.getvalue()
            if ret_text == '':
                self.reader_stack.pop()
                continue
            return ret_text
        return ''

    def next(self):
        while len(self.reader_stack)>0:
            try:
                v = self.reader_stack[-1].next()
            except StopIteration:
                self.reader_stack.pop()
                continue
            return v
        raise StopIteration

This class creates a stream structure that restricts the amount of characters that are returned to the user of the stream. I.e. even if the xml parser does read(16386) the class will only return bytes up to the next '>' character. Since the '>' character also signifies the end of tags, we have the opportunity to inject our recursive include into the stream at this point.

What is hackish about this solution is the following:

  • Reading one character at a time from a stream is slow.
  • This has an implicit assumption about how the sax stream class is reading text.

This solves the problem for me, but I'm still interested in a more beautiful solution.

+1  A: 

Have you considered using xinclude? The lxml library has builtin support for it.

Steven
Thanks, I'll check it out. I still have a lot to learn about xml. I have two use cases though. One is the include file that I described above. The second is the the definition and the use of "macros". Can the latter be supported by xinclude as well?
dov
I'm not sure what exactly you have in mind for the "macros", so I'm not sure whether it is something that can be easily supported by xinclude. Note that xinclude doesn't require that the included content is literally a "file". It might be content that is dynamically generated by a webserver, but you could also use a "resolver" (see lxml docs) to provide the content when it is requested during xinclude processing. Whether or not this will do for your macros, I can't tell.
Steven