ansaurus

Question

Alternative XML parser for ElementTree to ease UTF-8 woes?

Answer 1

+1 A:

Byte 0x92 is never valid as the first byte of a UTF-8 character. It can be valid as a subsequent byte, however. See this UTF-8 guide for a table of valid byte sequences.

Could you give us an idea of what bytes are surrounding 0x92? Does the XML declaration include a character encoding?

Jon Skeet 2009-07-16 17:41:49

Answer 2

+2 A:

It looks like you have CP1252 text. If so, it should be specified at the top of the file, eg.:

<?xml version="1.0" encoding="CP1252" ?>

This does work with ElementTree.

If you're creating these files yourself, don't write them in this encoding. Save them as UTF-8 and do your part to help kill obsolete text encodings.

If you're receiving CP1252 data with no encoding specification, and you know for sure that it's always going to be CP1252, you can just convert it to UTF-8 before sending it to the parser:

s.decode("CP1252").encode("UTF-8")

Glenn Maynard 2009-07-16 18:49:48

Not European, we're definitely in the US. I'm not doing it, I promise :)

Kekoa 2009-07-16 21:37:35

Your question is garbled: you said the text is "canít", which is a small letter I with an acute (u2019). I deal with enough unknown foreign languages on a regular basis that I interpret as written. Please fix the question.The answer is the same; just substitute CP852 for CP1252.By the way, 0x92 in CP1252 is not an apostrophe, it's a right single ‘quote’. I probably shouldn't be amazed that some software is broken enough to get *apostrophes* wrong. (Not your fault--the fault of whatever software outputted that string.)

Glenn Maynard 2009-07-16 23:57:31

@Glenn Maynard: (1) Reproduction of non-ASCII text by an OP is often garbled. What you see is not always what they've got. the_raw_bytes.repr() is their friend and yours. His "apostraphe" was a vital clue (2) "small letter I with an acute (u2019)": huh? According to the Unicode Standard, U+2019 is RIGHT SINGLE QUOTATION MARK which when encoded in cp1252 is 0x92 (3) The makers of the allegedly broken software must have been reading the Unicode Standard about U+2019: "this is the preferred character to use for apostrophe". (4) cp852? Its 0x92 -> SMALL LETTER L (ell not I eye) WITH ACUTE

John Machin 2009-07-17 06:53:01

I have to point out that if the Unicode Standard says that the preferred character for apostrophe is a close quote, the Unicode Standard is wrong. That violates common sense in many obvious ways, and I can guarantee that 0x27 apostrophe will continue to remain the correct representation of an apostrophe.

Glenn Maynard 2009-07-17 08:16:18

Sorry for it being unclear, but the text is really: 63 61 6E 92 74 , regardless of what it looks like in a particular editor.

Kekoa 2009-07-17 15:41:43

I got that, but what I interpreted was that that byte string appeared in editors to you as it did in the post, which is why I ended up at CP852. Anyway, your answer is there--just use s.decode("CP1252").encode("UTF-8"), or add <?xml version="1.0" encoding="CP1252" ?> to the top of the XML file if it makes sense to modify it directly. (You don't want to do that "transparently"--it'll mess up line numbers in errors, etc.)

Glenn Maynard 2009-07-18 00:09:03

@Glenn Maynard: Why ended up at cp852 is mystery. Character in post appears to be U+00ED LATIN SMALL LETTER I (eye) WITH ACUTE. 0x92 in cp852 is U+013A LATIN SMALL LETTER L (ell) WITH ACUTE. Look: ĺí. Other candidates: mac-roman etc U+00ED (eye), cp125X U+2019 RIGHT SINGLE QUOTATION MARK. Apart from eye problem, there's an a priori probability problem: Prob(Eastern Europe using DOS encoding for XML) less than Prob(mac-xxxx encoding) much less than Prob(usual suspects (cp125X especially cp1252)). Then there's a context problem: canĺt canít can’t ... what language has a NLT consonant cluster??

John Machin 2009-07-21 03:37:55

Are you just trolling, or do you really think you're saying anything relevant? I've given the correct answer to this person's question. Here, I'll even edit my answer with the trivial correction I pointed out twice already.

Glenn Maynard 2009-07-21 04:37:03

Answer 3

+1 A:

Ah. That is "can´t", obviously, and indeed, 0x92 is an apostrophe in many Windows code pages. Your editor assumes instead that it's a Mac file. ;)

If it's a one-off, fixing the file is the right thing to do. But almost always when you need to import other peoples XML there is a lot of things that simply do not agree with the stated encoding. I've found that the best solution is to decode with error setting 'xmlcharrefreplace', and in severe cases do your own custom character replacement that fixes the most common problems for that particular customer.

I'll also recommend lxml as XML library in Python, but that's not the problem here.

Lennart Regebro 2009-07-16 18:53:36

Answer 4

+8 A:

I'll start from the question: "Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?"

All XML parsers will accept data encoded in UTF-8. In fact, UTF-8 is the default encoding.

An XML document may start with a declaration like this:

`<?xml version="1.0" encoding="UTF-8"?>`

or like this: <?xml version="1.0"?> or not have a declaration at all ... in each case the parser will decode the document using UTF-8.

However your data is NOT encoded in UTF-8 ... it's probably Windows-1252 aka cp1252.

If the encoding is not UTF-8, then either the creator should include a declaration (or the recipient can prepend one) or the recipient can transcode the data to UTF-8. The following showcases what works and what doesn't:

>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio

>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration

>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8

>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again

>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works

>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception

>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8

>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed

John Machin 2009-07-17 04:43:54

ansaurus

tags:

views:

answers:

Alternative XML parser for ElementTree to ease UTF-8 woes?

related questions