Python: parsing XML document while preserving entities

views:

answers:

+1 Q:

Python: parsing XML document while preserving entities

I wanted to ask what known existing Python 2.x libraries there are for parsing an XML document with built-in DTD without automatically expanding the entities. (File in question for those curious: JMdict.)

It seems lxml has some option for not parsing the entities, but last I tried, the entities just ended up being converted to blanks. I just googled this and found pxdom as another alternative which I may try, but since it's pure Python it seems far slower than I'd like.

Anything else out there?

+1 A:

It seems that the use case is rather abnormal; not expanding entities seems to go against the way parsers are generally supposed to work according to the XML spec.

So, I think it's easiest to just kludge this perhaps. I've manually extracted the tags via re.finditer, and have made a dictionary of the mappings. From here, it's just a matter of scanning the parsed output and doing the right thing for my app. Good enough for my use case I think.

Vultaire 2010-08-19 15:17:42

+1 A:

For one, BeautifulStoneSoup from BeautifulSoup won't expand entities by default.

Won't probably be fast or efficient for your use case though, since it is geared towards a different kind of usage (handling all sorts of ill-formed and broken markup).

Jukka Matilainen 2010-08-19 22:21:10

It's good to know though. Thanks.

Vultaire 2010-08-22 05:41:55

ansaurus

tags:

views:

answers:

Python: parsing XML document while preserving entities

related questions