tags:

views:

47

answers:

3

I'm trying to remove everything in an XML Document between 2 tags, using python & lxml. the problem is that the tags can be in different branches of the tree (but always at the same depth) an example document might look like this.

<root>
    <p> Hello world <start />this is a paragraph </p>
    <p> Goodbye world. <end />I'm leaving now </p>
</root>

i'd like to remove everything between the start and end tags. which would result in a single p tag:

<root>
    <p> Hello world I'm leaving now </p>
</root>

does anyone have any idea how this might be accomplished using lxml & python?

+1  A: 

You've got a mess on your hands and should slap the person who wrote an intentional perversion of the XML nesting rule.

You are probably best of using something like SAX to recognize the <start/> tag and begin discarding input until you hit an <end/>. SAX has the advantage over lxml here because it allows you to take arbitrary actions per lexeme while lxml will have already divorced start and end before you get to touch them.

While you're at it, you might want to convert those documents to usable XML.

msw
oh how I wish I could do something about it.. this is an ODT file. they use these for "tracking changes", unfortunately I'm doing a lot of other manipulation on the file using etree so I'm not sure if I can switch over to SAX :( glad to know it will handle it though, i may need to look into it.
+1  A: 

I know there are some people who'll want to stone me for this, but you could just use regex:

import re
new_string = re.sub(r'<start />(.*?)<end />', '', your_string, re.S)

You can't use an XML parser when it's not valid XML.

NullUserException
I tend to agree. While this *could* be less efficient in some circumstances, it probably wouldn't pose that much of a performance issue.
William
the XML is perfectly valid. note that start and end are complete self-closing tags. I have thought about the regex route, but the document is huge and there are many occurences of this that i need to remove.
@user61 You are right, it is *valid* XML. I don't know a better word. It's not "proper" XML, maybe? Anyway, if you can read it in slurp mode, then you'll probably be fine regardless of size.
NullUserException
@NullUserException it looks like that's the route i'm going to have to take, but in the same script i'm doing a lot of other manipulation of the document and I'm not sure how it will lend itself to being done in "slurp" mode, which i've never used before.
@user61 It just means reading the whole file at once.
NullUserException
@NullUserException, Sorry, I thought you were referring to using SAX or soemthing like that. the only problem I have with the regex method is that I have a list of operations (insertions and deletions) that I need to do on the document that need to be applied in the correct order, so I'd have to dump the entire document to a string, and re-parse it in its entirety for every single deletion, of which there could be hundreds in a single document.
A: 

You could try using the SAX-like target parser interface:

from lxml import etree

class SkipStartEndTarget:
    def __init__(self, *args, **kwargs):
        self.builder = etree.TreeBuilder()
        self.skip = False

    def start(self, tag, attrib, nsmap=None):
        if tag == 'start':
            self.skip = True
        if not self.skip:
            self.builder.start(tag, attrib, nsmap)

    def data(self, data):
        if not self.skip:
            self.builder.data(data)

    def comment(self, comment):
        if not self.skip:
            self.builder.comment(self)

    def pi(self, target, data):
        if not self.skip:
            self.builder.pi(target, data)

    def end(self, tag):
        if not self.skip:
            self.builder.end(tag)
        if tag == 'end':
            self.skip = False

    def close(self):
        self.skip = False
        return self.builder.close()

You can then use the SkipStartEndTarget class to make a parser target, and create a custom XMLParser with that target, like this:

parser = etree.XMLParser(target=SkipStartEndTarget())

You can still provide other parser options to the parser if you need them. Then you can provide this parser to the parser function you are using, for example:

elem = etree.fromstring(xml_str, parser=parser)

This also works with etree.XML() and etree.parse(), and you can even set the parser as the default parser with etree.setdefaultparser() (which is probably not a good idea). One thing that might trip you: even with etree.parse(), this will not return an elementtree, but always an element (as etree.XML() and etree.fromstring() do). I don't think this can be done (yet), so if this is an issue to you, you will have to work around it somehow.

Note that it is also possible to use create an elementtree from sax events, with lxml.sax, which is probably somewhat more difficult and slower. Contrary to the above example, it will return an elementtree, but I think it doesn't provide the .docinfo you would get when using etree.parse() normally. I also believe it (currently) doesn't support comments and pi's. (haven't used it yet, so I can't be more precise at the moment)

Also note that any SAX-like approach to parsing the document requires that skipping everything between <start/> and <end/> will still result in a well-formed document, which is the case in your example, but would not be the case if the second <p> was a <p2> for example, as you'd end up with <p>....</p2>.

Steven