You could try using the SAX-like target parser interface:
from lxml import etree
class SkipStartEndTarget:
def __init__(self, *args, **kwargs):
self.builder = etree.TreeBuilder()
self.skip = False
def start(self, tag, attrib, nsmap=None):
if tag == 'start':
self.skip = True
if not self.skip:
self.builder.start(tag, attrib, nsmap)
def data(self, data):
if not self.skip:
self.builder.data(data)
def comment(self, comment):
if not self.skip:
self.builder.comment(self)
def pi(self, target, data):
if not self.skip:
self.builder.pi(target, data)
def end(self, tag):
if not self.skip:
self.builder.end(tag)
if tag == 'end':
self.skip = False
def close(self):
self.skip = False
return self.builder.close()
You can then use the SkipStartEndTarget
class to make a parser target
, and create a custom XMLParser
with that target, like this:
parser = etree.XMLParser(target=SkipStartEndTarget())
You can still provide other parser options to the parser if you need them. Then you can provide this parser to the parser function you are using, for example:
elem = etree.fromstring(xml_str, parser=parser)
This also works with etree.XML()
and etree.parse()
, and you can even set the parser as the default parser with etree.setdefaultparser()
(which is probably not a good idea). One thing that might trip you: even with etree.parse()
, this will not return an elementtree, but always an element (as etree.XML()
and etree.fromstring()
do). I don't think this can be done (yet), so if this is an issue to you, you will have to work around it somehow.
Note that it is also possible to use create an elementtree from sax events, with lxml.sax, which is probably somewhat more difficult and slower. Contrary to the above example, it will return an elementtree, but I think it doesn't provide the .docinfo
you would get when using etree.parse()
normally. I also believe it (currently) doesn't support comments and pi's. (haven't used it yet, so I can't be more precise at the moment)
Also note that any SAX-like approach to parsing the document requires that skipping everything between <start/>
and <end/>
will still result in a well-formed document, which is the case in your example, but would not be the case if the second <p>
was a <p2>
for example, as you'd end up with <p>....</p2>
.