views:

412

answers:

4

Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome.

The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX'

>>> fragment = 'PREFIX\xe3\xabSUFFIX'
>>> fragment.decode('utf-8', 'strict')
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data

UPDATE: This question concluded in a bug report to Python unicode component


What follows is the replacement policies used to handle decoding errors in Python, Firefox and Chrome. Note how they differs, and specially how python builtin removes the valid S (plus the invalid sequence of bytes).

Python

The builtin replace error handler replaces the invalid \xe3\xab plus the S from SUFFIX by U+FFFD

>>> fragment.decode('utf-8', 'replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX�UFFIX

Browsers

To tests how browsers decode the invalid sequence of bytes will use a cgi script:

#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8

PREFIX\xe3\xabSUFFIX"""

Firefox and Chrome browsers rendered:

PREFIX�SUFFIX

Why builtin replace error handler for str.decode is removing the S from SUFFIX

(Was UPDATE 1)

According to wikipedia UTF-8 (thanks mjv), the following ranges of bytes are used to indicate the start of a sequence of bytes

  • 0xC2-0xDF : Start of 2-byte sequence
  • 0xE0-0xEF : Start of 3-byte sequence
  • 0xF0-0xF4 : Start of 4-byte sequence

'PREFIX\xe3\abSUFFIX' test fragment has 0xE3, it instructs python decoder that a 3-byte sequence follows, the sequence is found invalid and python decoder ignores the whole sequence including '\xabS', and continues after it ignoring any possible correct sequence starting in the middle.

This means that for an invalid encoded sequence like '\xF0SUFFIX', it will decode u'\ufffdFIX' instead of u'\ufffdSUFFIX'.

Example 1: Introducing DOM parsing bugs

>>> '<div>\xf0<div>Price: $20</div>...</div>'.decode('utf-8', 'replace')
u'<div>\ufffdv>Price: $20</div>...</div>'
>>> print _
<div>�v>Price: $20</div>...</div>

Example 2: Security issues (Also see Unicode security considerations):

>>> '\xf0<!-- <script>alert("hi!");</script> -->'.decode('utf-8', 'replace')
u'\ufffd- <script>alert("hi!");</script> -->'
>>> print _
�- <script>alert("hi!");</script> -->

Example 3: Remove valid information for a scraping application

>>> '\xf0' + u'it\u2019s'.encode('utf-8') # "it’s"
'\xf0it\xe2\x80\x99s'
>>> _.decode('utf-8', 'replace')
u'\ufffd\ufffd\ufffds'
>>> print _
���s

Using a cgi script to render this in browsers:

#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8

\xf0it\xe2\x80\x99s"""

Rendered:

�it’s

Is there any official recommended way for handling decoding replacements?

(Was UPDATE 2)

In a public review, the Unicode Technical Committee has opted for option 2 of the following candidates:

  1. Replace the entire ill-formed subsequence by a single U+FFFD.
  2. Replace each maximal subpart of the ill-formed subsequence by a single U+FFFD.
  3. Replace each code unit of the ill-formed subsequence by a single U+FFFD.

UTC Resolution was at 2008-08-29, source: http://www.unicode.org/review/resolved-pri-100.html

UTC Public Review 121 also includes an invalid bytestream as example '\x61\xF1\x80\x80\xE1\x80\xC2\x62', it shows decoding results for each option.

            61      F1      80      80      E1      80      C2      62
      1   U+0061  U+FFFD                                          U+0062
      2   U+0061  U+FFFD                  U+FFFD          U+FFFD  U+0062
      3   U+0061  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+0062

In plain Python the three results are:

  1. u'a\ufffdb' shows as a�b
  2. u'a\ufffd\ufffd\ufffdb' shows as a���b
  3. u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb' shows as a������b

And here is what python does for the invalid example bytestream:

>>> '\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace')
u'a\ufffd\ufffd\ufffd'
>>> print _
a���

Again, using a cgi script to test how browsers render the buggy encoded bytes:

#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8

\x61\xF1\x80\x80\xE1\x80\xC2\x62"""

Both, Chrome and Firefox rendered:

a���b

Note that browsers rendered result matches option 2 of PR121 recomendation

While option 3 looks easily implementable in python, option 2 and 1 are a challenge.

>>> replace_option3 = lambda exc: (u'\ufffd', exc.start+1)
>>> codecs.register_error('replace_option3', replace_option3)
>>> '\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace_option3')
u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb'
>>> print _
a������b
+7  A: 

the 0xE3 byte is one (of the possible) first bytes indicative of a 3-bytes character.

Apparently Python's decode logic takes these three bytes and tries to decode them. They turn out to not match an actual code point ("character") and that is why Python produces a UnicodeDecodeError and emits a substitution character
It appears, however that in doing so, Python's decode logic doesn't adhere to the recommendation of the Unicode Consortium with regards to substitution characters for "ill-formed" UTF-8 sequences.

See UTF-8 article on Wikipedia for background info about UTF-8 encoding.

New (final?) Edit: re the UniCode Consortium's recommended practice for replacement characters (PR121)
(BTW, congrats to dangra to keep digging and digging and hence making the question better)
Both dangra and I were partially incorrect, in our own way, regarding the interpretation of this recommendation; my latest insight is that indeed the recommendation also speaks to trying and "re-synchronize".
The key concept is that of the maximal subpart [of an ill-formed sequence].
In view of the (lone) example supplied in the PR121 document, the "maximal subpart" implies not reading-in the bytes which could not possibly be part of a sequence. For example, the 5th byte in the sequence, 0xE1 could NOT possibly be a "second, third or fourth byte of a sequence" since it isn't in the x80-xBF range, and hence this terminates the ill-formed sequence which started with xF1. Then one must try and start a new sequence with the xE1 etc. Similarly, upon hitting the x62 which too cannot be interpreted as a second/third/fourth byte, the bad sequence is ended, and the "b" (x62) is "saved"...

In this light (and until corrected ;-) ) the Python decoding logic appears to be faulty.

Also see John Machin's answer in this post for more specific quotes of the underlying Unicode standard/recommendations.

mjv
thanks!, Don't you think that example bytestream in PR121 can be used as test case for expected result of best replacement practice? I updated UPDATE-2 with some thoughts on it.
dangra
@dangra: thanks again. See my new edits!.
mjv
@mjv: See my update 2 (about half-an-hour earlier) for a quote from the actual part of the standard where it expressly forbids what Python is doing.
John Machin
+4  A: 

In 'PREFIX\xe3\xabSUFFIX', the \xe3 indicates that it and the next two bites form one unicode code point. (\xEy does for all y.) However, \xe3\xabS obviously does not refer to a valid code point. Since Python knows it's supposed to take three bytes, it sucks up all three anyhow since it doesn't know your S is an S and not just some byte representing 0x53 for some other reason.

Mike Graham
A: 

Also, is there any unicode's official recommended way for handling decoding replacements?

No. Unicode considers them an error condition and doesn't consider any fallback options. So none of the behaviours above are ‘right’.

bobince
It seems they are considering it at least http://unicode.org/review/pr-121.html.Also, the U+FFFD character is named as a REPLACEMENT CHARACTER in http://www.unicode.org/charts/PDF/UFFF0.pdf, and description for it reads _used to replace an incoming character whose value is unknown or unrepresentable in Unicode_
dangra
According to http://www.unicode.org/review/resolved-pri-100.html, the UTC has opted for option 2 of http://unicode.org/review/pr-121.html at 2008-08-29
dangra
Thanks, that's an interesting development. Makes my answer wrong now, but I can't delete it without losing these comments! :-) Might be worth filing a bug against Python to get it to match. FWIW, every browser I tried agreed with this interpretation except for IE which went for one-�-per-byte. I couldn't get that to happen with Firefox.
bobince
+6  A: 

You know that your S is valid, with the benefit of both look-ahead and hindsight :-) Suppose there was originally a legal 3-byte UTF-8 sequence there, and the 3rd byte was corrupted in transmission ... with the change that you mention, you'd be complaining that a spurious S had not been replaced. There is no "right" way of doing it, without the benefit of error-correcting codes, or a crystal ball, or a tamborine.

Update

As @mjv remarked, the UTC issue is all about how many U+FFFD should be included.

In fact, Python is not using ANY of the UTC's 3 options.

Here is the UTC's sole example:

      61      F1      80      80      E1      80      C2      62
1   U+0061  U+FFFD                                          U+0062
2   U+0061  U+FFFD                  U+FFFD          U+FFFD  U+0062
3   U+0061  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+0062

Here is what Python does:

>>> bad = '\x61\xf1\x80\x80\xe1\x80\xc2\x62cdef'
>>> bad.decode('utf8', 'replace')
u'a\ufffd\ufffd\ufffdcdef'
>>>

Why?

F1 should start a 4-byte sequence, but the E1 is not valid. One bad sequence, one replacement.
Start again at the next byte, the 3rd 80. Bang, another FFFD.
Start again at the C2, which introduces a 2-byte sequence, but C2 62 is invalid, so bang again.

It's interesting that the UTC didn't mention what Python is doing (restarting after the number of bytes indicated by the lead character). Perhaps this is actually forbidden or deprecated somewhere in the Unicode standard. More reading required. Watch this space.

Update 2 Houston, we have a problem.

=== Quoted from Chapter 3 of Unicode 5.2 ===

Constraints on Conversion Processes

The requirement not to interpret any ill-formed code unit subsequences in a string as characters (see conformance clause C10) has important consequences for conversion processes.

Such processes may, for example, interpret UTF-8 code unit sequences as Unicode character sequences. If the converter encounters an ill-formed UTF-8 code unit sequence which starts with a valid first byte, but which does not continue with valid successor bytes (see Table 3-7), it must not consume the successor bytes as part of the ill-formed subsequence whenever those successor bytes themselves constitute part of a well-formed UTF-8 code unit subsequence.

If an implementation of a UTF-8 conversion process stops at the first error encountered, without reporting the end of any ill-formed UTF-8 code unit subsequence, then the requirement makes little practical difference. However, the requirement does introduce a significant constraint if the UTF-8 converter continues past the point of a detected error, perhaps by substituting one or more U+FFFD replacement characters for the uninterpretable, ill-formed UTF-8 code unit subsequence. For example, with the input UTF-8 code unit sequence <C2 41 42>, such a UTF-8 conversion process must not return <U+FFFD> or <U+FFFD, U+0042>, because either of those outputs would be the result of misinterpreting a well-formed subsequence as being part of the ill-formed subsequence. The expected return value for such a process would instead be <U+FFFD, U+0041, U+0042>.

For a UTF-8 conversion process to consume valid successor bytes is not only non-conformant, but also leaves the converter open to security exploits. See Unicode Technical Report #36, “Unicode Security Considerations.”

=== End of quote ===

It then goes on to discuss at length, with examples, the "how many FFFD to emit" issue.

Using their example in the 2nd last quoted paragraph:

>>> bad2 = "\xc2\x41\x42"
>>> bad2.decode('utf8', 'replace')
u'\ufffdB'
# FAIL

Note that this is a problem with both the 'replace' and 'ignore' options of str.decode('utf_8') -- it's all about omitting data, not about how many U+FFFD are emitted; get the data-emitting part right and the U+FFFD issue falls out naturally, as explained in the part that I didn't quote.

Update 3 Current versions of Python (including 2.7) have unicodedata.unidata_version as '5.1.0' which may or may not indicate that the Unicode-related code is intended to conform to Unicode 5.1.0. In any case, the wordy prohibition of what Python is doing didn't appear in the Unicode standard until 5.2.0. I'll raise an issue on the Python tracker without mentioning the word 'oht'.encode('rot13').

Reported here

John Machin
+1 I did same research few minutes ago, but I am still convinced that the example UTC gives in PR121 is still valid for replacement practice and that python decoder is not following it (It skips the last 0x62). We know _why_, but is it correct?
dangra
@dangra: It seems Python's action is incorrect; see my update 2.
John Machin
@John Machin: +1 for digging up the quotes out of `Unicode 5.2`
mjv
@John Machin: thanks! I accepted your answer because it has gone to the bone of the problem.
dangra
I mistakenly flagged this question as "community wiki", can it be unflagged?
dangra
@dangra: I believe that happens automatically if you edit your question more than about 8 times.
John Machin
@John Machin: makes sense now
dangra