Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome.
The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX'
>>> fragment = 'PREFIX\xe3\xabSUFFIX'
>>> fragment.decode('utf-8', 'strict')
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data
UPDATE: This question concluded in a bug report to Python unicode component
What follows is the replacement policies used to handle decoding errors in
Python, Firefox and Chrome. Note how they differs, and specially how
python builtin removes the valid S
(plus the invalid sequence of bytes).
Python
The builtin replace
error handler replaces the invalid \xe3\xab
plus the
S
from SUFFIX
by U+FFFD
>>> fragment.decode('utf-8', 'replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX�UFFIX
Browsers
To tests how browsers decode the invalid sequence of bytes will use a cgi script:
#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8
PREFIX\xe3\xabSUFFIX"""
Firefox and Chrome browsers rendered:
PREFIX�SUFFIX
Why builtin replace
error handler for str.decode
is removing the S
from SUFFIX
(Was UPDATE 1)
According to wikipedia UTF-8 (thanks mjv), the following ranges of bytes are used to indicate the start of a sequence of bytes
- 0xC2-0xDF : Start of 2-byte sequence
- 0xE0-0xEF : Start of 3-byte sequence
- 0xF0-0xF4 : Start of 4-byte sequence
'PREFIX\xe3\abSUFFIX'
test fragment has 0xE3, it instructs python decoder
that a 3-byte sequence follows, the sequence is found invalid and python
decoder ignores the whole sequence including '\xabS'
, and continues after it
ignoring any possible correct sequence starting in the middle.
This means that for an invalid encoded sequence like '\xF0SUFFIX'
, it will
decode u'\ufffdFIX'
instead of u'\ufffdSUFFIX'
.
Example 1: Introducing DOM parsing bugs
>>> '<div>\xf0<div>Price: $20</div>...</div>'.decode('utf-8', 'replace')
u'<div>\ufffdv>Price: $20</div>...</div>'
>>> print _
<div>�v>Price: $20</div>...</div>
Example 2: Security issues (Also see Unicode security considerations):
>>> '\xf0<!-- <script>alert("hi!");</script> -->'.decode('utf-8', 'replace')
u'\ufffd- <script>alert("hi!");</script> -->'
>>> print _
�- <script>alert("hi!");</script> -->
Example 3: Remove valid information for a scraping application
>>> '\xf0' + u'it\u2019s'.encode('utf-8') # "it’s"
'\xf0it\xe2\x80\x99s'
>>> _.decode('utf-8', 'replace')
u'\ufffd\ufffd\ufffds'
>>> print _
���s
Using a cgi script to render this in browsers:
#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8
\xf0it\xe2\x80\x99s"""
Rendered:
�it’s
Is there any official recommended way for handling decoding replacements?
(Was UPDATE 2)
In a public review, the Unicode Technical Committee has opted for option 2 of the following candidates:
- Replace the entire ill-formed subsequence by a single U+FFFD.
- Replace each maximal subpart of the ill-formed subsequence by a single U+FFFD.
- Replace each code unit of the ill-formed subsequence by a single U+FFFD.
UTC Resolution was at 2008-08-29, source: http://www.unicode.org/review/resolved-pri-100.html
UTC Public Review 121 also includes an invalid bytestream as example
'\x61\xF1\x80\x80\xE1\x80\xC2\x62'
, it shows decoding results for each
option.
61 F1 80 80 E1 80 C2 62
1 U+0061 U+FFFD U+0062
2 U+0061 U+FFFD U+FFFD U+FFFD U+0062
3 U+0061 U+FFFD U+FFFD U+FFFD U+FFFD U+FFFD U+FFFD U+0062
In plain Python the three results are:
u'a\ufffdb'
shows asa�b
u'a\ufffd\ufffd\ufffdb'
shows asa���b
u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb'
shows asa������b
And here is what python does for the invalid example bytestream:
>>> '\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace')
u'a\ufffd\ufffd\ufffd'
>>> print _
a���
Again, using a cgi script to test how browsers render the buggy encoded bytes:
#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8
\x61\xF1\x80\x80\xE1\x80\xC2\x62"""
Both, Chrome and Firefox rendered:
a���b
Note that browsers rendered result matches option 2 of PR121 recomendation
While option 3 looks easily implementable in python, option 2 and 1 are a challenge.
>>> replace_option3 = lambda exc: (u'\ufffd', exc.start+1)
>>> codecs.register_error('replace_option3', replace_option3)
>>> '\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace_option3')
u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb'
>>> print _
a������b