I have a large (~50Mb) file containing poorly formatted XML describing documents and properties between <item> </item>
tags, and I want to extract the text from all English documents.
Python's standard XML parsing utilities (dom, sax, expat) choke on the bad formatting, and more forgiving libraries (sgmllib, BeautifulSoup) parse the entire file and take too long.
<item>
<title>some title</title>
<author>john doe</author>
<lang>en</lang>
<document> .... </document>
</item>
Does anyone know a way to extract text between <document> </document>
only if the lang=en
without parsing the entire document?
Additional information: Why it's "poorly formatted"
Some of the documents have an attribute <dc:link></dc:link>
which causes problems with the parsers. Python's xml.minidom complains:
ExpatError: unbound prefix: line 13, column 0