views:

91

answers:

2

I'm trying to read in an xml file which looks like this

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<incollection>
<author>Jos&eacute; A. Blakeley</author>
</incollection>
</dblp>

The point that creates the problem looks is the

Jos&eacute; A. Blakeley

part: The parser calls its character handler twice, once with "Jos", once with " A. Blakeley". Now I understand this may be the correct behaviour if it doesn't know the eacute entity. However, this is defined in the dblp.dtd, which I have. I don't seem to be able to convince expat to use this file, though. All I can say is

p = xml.parsers.expat.ParserCreate()
# tried with and without following line
p.SetParamEntityParsing(xml.parsers.expat.XML_PARAM_ENTITY_PARSING_ALWAYS) 
p.UseForeignDTD(True)
f = open(dblp_file, "r")
p.ParseFile(f)

but expat still doesn't recognize my entity. Why is there no way to tell expat which DTD to use? I've tried

  • putting the file into the same directory as the XML
  • putting the file into the program's working directory
  • replacing the reference in the xml file by an absolute path

What am I missing? Thx.

+1  A: 

As I understand it, if you're using pyexpat directly, then you have to provide your own ExternalEntityRefHandler to fetch the external DTD and feed it to expat.

See eg. xml.sax.expatreader for example code (method external_entity_ref, line 374 in Python 2.6).

It would probably be better to use a higher-level interface such as SAX (via expatreader) if you can.

bobince
Thanks for your answer. I'm however running into problems with both proposed approaches: From the example in external_entity_ref, I still don't see how to include a DTD here. Using xml.sax.make_parser, I get File "...expatreader.py", line 207, in feed self._parser.Parse(data, isFinal) ... UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)Both problems are obviously related to my lack of knowledge containing the inner workings of these libraries, but I'm a bit puzzled such a trivial task requires so much knowledge...
Nicolas78
The `external_entity_ref` method receives the address of an external entity in `sysid` (here, the DTD, `dblp.dtd`). It fetches the data from that source (`prepare_input_source`), creates a new parsing context (`ExternalEntityParserCreate`) and calls back into the parse loop to continue, taking content from the external source instead of the original one. Something like this ugly process is what you'd have to reproduce to make DTDs work in your own pyexpat code.
bobince
As for the SAX approach, how exactly are you feeding the parser? It looks like for some reason it is being fed unicode characters instead of bytes as an XML parser should be. XML has its own bytes-to-characters decode mechanisms built-in so you don't want Python doing that for you. You're not using Python 3 are you? If so, you should make sure to open files as bytes using `rb` instead of `r`. (Actually you should do that on Python 2 as well, but it's not as critical there as it only affects newlines.)
bobince
wow thanks a lot for these extensive explanations. I'll check both asap and post my final results.
Nicolas78
A: 

btw I can temporarily help myself by copying the relevant parts of the .dtd into the XML file itself, as in

<!DOCTYPE dblp [
    <!ENTITY Agrave  "&#192;" >
]>

but that doesn't really solve the problem in a general way.

Nicolas78