views:

1673

answers:

4

I am parsing some XML with the elementtree.parse() function. It works, except for some utf-8 characters(single byte character above 128). I see that the default parser is XMLTreeBuilder which is based on expat.

Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?

This is the error I'm getting with the default parser:

ExpatError: not well-formed (invalid token): line 311, column 190

The character causing this is a single byte x92 (in hex). I'm not certain this is even a valid utf-8 character. But it would be nice to handle it because most text editors display this as: í

EDIT: The context of the character is: canít , where I assume it is supposed to be a fancy apostraphe, but in the hex editor, that same sequence is: 63 61 6E 92 74

+1  A: 

Byte 0x92 is never valid as the first byte of a UTF-8 character. It can be valid as a subsequent byte, however. See this UTF-8 guide for a table of valid byte sequences.

Could you give us an idea of what bytes are surrounding 0x92? Does the XML declaration include a character encoding?

Jon Skeet
+2  A: 

It looks like you have CP1252 text. If so, it should be specified at the top of the file, eg.:

<?xml version="1.0" encoding="CP1252" ?>

This does work with ElementTree.

If you're creating these files yourself, don't write them in this encoding. Save them as UTF-8 and do your part to help kill obsolete text encodings.

If you're receiving CP1252 data with no encoding specification, and you know for sure that it's always going to be CP1252, you can just convert it to UTF-8 before sending it to the parser:

s.decode("CP1252").encode("UTF-8")
Glenn Maynard
Not European, we're definitely in the US. I'm not doing it, I promise :)
Kekoa
Your question is garbled: you said the text is "canít", which is a small letter I with an acute (u2019). I deal with enough unknown foreign languages on a regular basis that I interpret as written. Please fix the question.The answer is the same; just substitute CP852 for CP1252.By the way, 0x92 in CP1252 is not an apostrophe, it's a right single ‘quote’. I probably shouldn't be amazed that some software is broken enough to get *apostrophes* wrong. (Not your fault--the fault of whatever software outputted that string.)
Glenn Maynard
@Glenn Maynard: (1) Reproduction of non-ASCII text by an OP is often garbled. What you see is not always what they've got. the_raw_bytes.repr() is their friend and yours. His "apostraphe" was a vital clue (2) "small letter I with an acute (u2019)": huh? According to the Unicode Standard, U+2019 is RIGHT SINGLE QUOTATION MARK which when encoded in cp1252 is 0x92 (3) The makers of the allegedly broken software must have been reading the Unicode Standard about U+2019: "this is the preferred character to use for apostrophe". (4) cp852? Its 0x92 -> SMALL LETTER L (ell not I eye) WITH ACUTE
John Machin
I have to point out that if the Unicode Standard says that the preferred character for apostrophe is a close quote, the Unicode Standard is wrong. That violates common sense in many obvious ways, and I can guarantee that 0x27 apostrophe will continue to remain the correct representation of an apostrophe.
Glenn Maynard
Sorry for it being unclear, but the text is really: 63 61 6E 92 74 , regardless of what it looks like in a particular editor.
Kekoa
I got that, but what I interpreted was that that byte string appeared in editors to you as it did in the post, which is why I ended up at CP852. Anyway, your answer is there--just use s.decode("CP1252").encode("UTF-8"), or add <?xml version="1.0" encoding="CP1252" ?> to the top of the XML file if it makes sense to modify it directly. (You don't want to do that "transparently"--it'll mess up line numbers in errors, etc.)
Glenn Maynard
@Glenn Maynard: Why ended up at cp852 is mystery. Character in post appears to be U+00ED LATIN SMALL LETTER I (eye) WITH ACUTE. 0x92 in cp852 is U+013A LATIN SMALL LETTER L (ell) WITH ACUTE. Look: ĺí. Other candidates: mac-roman etc U+00ED (eye), cp125X U+2019 RIGHT SINGLE QUOTATION MARK. Apart from eye problem, there's an a priori probability problem: Prob(Eastern Europe using DOS encoding for XML) less than Prob(mac-xxxx encoding) much less than Prob(usual suspects (cp125X especially cp1252)). Then there's a context problem: canĺt canít can’t ... what language has a NLT consonant cluster??
John Machin
Are you just trolling, or do you really think you're saying anything relevant? I've given the correct answer to this person's question. Here, I'll even edit my answer with the trivial correction I pointed out twice already.
Glenn Maynard
+1  A: 

Ah. That is "can´t", obviously, and indeed, 0x92 is an apostrophe in many Windows code pages. Your editor assumes instead that it's a Mac file. ;)

If it's a one-off, fixing the file is the right thing to do. But almost always when you need to import other peoples XML there is a lot of things that simply do not agree with the stated encoding. I've found that the best solution is to decode with error setting 'xmlcharrefreplace', and in severe cases do your own custom character replacement that fixes the most common problems for that particular customer.

I'll also recommend lxml as XML library in Python, but that's not the problem here.

Lennart Regebro
+8  A: 

I'll start from the question: "Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?"

All XML parsers will accept data encoded in UTF-8. In fact, UTF-8 is the default encoding.

An XML document may start with a declaration like this:

`<?xml version="1.0" encoding="UTF-8"?>`

or like this: <?xml version="1.0"?> or not have a declaration at all ... in each case the parser will decode the document using UTF-8.

However your data is NOT encoded in UTF-8 ... it's probably Windows-1252 aka cp1252.

If the encoding is not UTF-8, then either the creator should include a declaration (or the recipient can prepend one) or the recipient can transcode the data to UTF-8. The following showcases what works and what doesn't:

>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio

>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration

>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8

>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again

>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works

>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception

>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8

>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed
John Machin