tags:

views:

50

answers:

2

Hi,

I'm trying to parse large XML files (>3GB) like this:

context = lxml.etree.iterparse(path)
for action,el in self.context:
    # do sth. with el

With iterparse I thought the data is not completely loaded into RAM, but according to this article I'm wrong: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ (see Listing 4) Though when I apply this solution to my code, some elements are obviously cleared which have not been parsed so far (especially child-elements of el).

Is there any other solution to this memory problem?

Thanks in advance!

+2  A: 

Don't forget to use clear(), optionally also clearing the root element, as explained here. But as I understand, you're already doing this, but apparently you are trying to access content that you have already cleared, or that is not yet parsed. It would be helpful if you could provide something more than "do sth. with el". Are you using getnext() or getprevious()? Xpath expressions?

Another option, if you really don't want to build a tree, is to use the target parser interface, which is like SAX for lxml/etree (but easier).

Steven
A: 

I solved this issue by selecting the tag directly with the context:

lxml.etree.iterparse(path, tag=tag)

and not with an additional if-clause.

Thank you very much for your support!

ahojnnes