views:

59

answers:

2

I wanted to ask what known existing Python 2.x libraries there are for parsing an XML document with built-in DTD without automatically expanding the entities. (File in question for those curious: JMdict.)

It seems lxml has some option for not parsing the entities, but last I tried, the entities just ended up being converted to blanks. I just googled this and found pxdom as another alternative which I may try, but since it's pure Python it seems far slower than I'd like.

Anything else out there?

+1  A: 

It seems that the use case is rather abnormal; not expanding entities seems to go against the way parsers are generally supposed to work according to the XML spec.

So, I think it's easiest to just kludge this perhaps. I've manually extracted the tags via re.finditer, and have made a dictionary of the mappings. From here, it's just a matter of scanning the parsed output and doing the right thing for my app. Good enough for my use case I think.

Vultaire
+1  A: 

For one, BeautifulStoneSoup from BeautifulSoup won't expand entities by default.

Won't probably be fast or efficient for your use case though, since it is geared towards a different kind of usage (handling all sorts of ill-formed and broken markup).

Jukka Matilainen
It's good to know though. Thanks.
Vultaire