ansaurus

Question

How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

Answer 1

+1 A:

Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.

pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)

Edit adding Python from Denilson Sá:

pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)

drawnonward 2010-07-10 17:37:48

Answer 2

+1 A:

Encode as UTF-16, then reencode as UTF-8.

>>> t = u''
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'

Note that you can't encode after joining, since the surrogate pairs may be decoded before reencoding.

EDIT:

MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:

mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)

  ...

>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u''
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'\ud835\udc1f\ud835\udc28\ud835\udc28',)
>>> print r[0]

Ignacio Vazquez-Abrams 2010-07-10 18:09:10

Perhaps `struct.unpack('<%dH' % (len(e)//2), e)`?

ΤΖΩΤΖΙΟΥ 2010-07-11 08:45:01

(1) The MySQL docs that I referred to declare the charset as part of the column definition: `t character(128) character set utf8` ... are you sure that what you have is equivalent? (2) Try your UTF-16 stunt with Python 3.1 :-)

John Machin 2010-07-11 12:14:26

@John: (1) Retested with `character set utf8` on 2.6. Results were the same. (2) That's just a limitation of the stock UTF-8 codec. It can be worked around with a custom codec. Or with MySQL doing the right thing in the first place.

Ignacio Vazquez-Abrams 2010-07-11 12:27:31

Answer 3

A:

I'm guessing it's not the fastest, but quite straightforward (“pythonic” :) :

def max3bytes(unicode_string):
    return u''.join(uc if uc <= u'\uffff' else u'\ufffd' for uc in unicode_string)

NB: this code does not take into account the fact that Unicode has surrogate characters in the ranges U+D800-U+DFFF.

ΤΖΩΤΖΙΟΥ 2010-07-11 08:23:03

Perhaps it should exclude surrogates. Also: `uc <= u'\uffff'` might be better than `ord(uc) < 65536`

John Machin 2010-07-11 08:28:21

@John: You are correct on both issues.

ΤΖΩΤΖΙΟΥ 2010-07-11 08:40:30

Answer 4

+1 A:

And just for the fun of it, an itertools monstrosity :)

import itertools as it, operator as op

def max3bytes(unicode_string):

    # sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
    pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))

    # is the argument less than or equal to 65535?
    selector= ft.partial(op.le, 65535)

    # using the character ordinals, return 0 or 1 based on `selector`
    indexer= it.imap(selector, it.imap(ord, unicode_string))

    # now pick the correct item for all pairs
    return u''.join(it.imap(tuple.__getitem__, pairs, indexer))

ΤΖΩΤΖΙΟΥ 2010-07-11 08:35:51

Answer 5

+1 A:

According to the MySQL 5.1 documentation: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP." This indicates that there might be a problem with surrogate pairs.

Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence ... see for example page 93 """Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.""" However this proscription is as far as I know largely unknown or ignored.

It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:

all(uc < '\ud800' or '\ue000' <= uc <= '\uffff' for uc in unicode_string)

and this code will replace any "nasties" with u\fffd:

u''.join(
    uc if uc < '\ud800' or '\ue000' <= uc <= '\uffff' else u'\fffd'
    for uc in unicode_string
    )

John Machin 2010-07-11 09:39:34

“However this proscription is as far as I know largely unknown or ignored.”—Hopefully not! At least Python 3 refuses to encode surrogate code points (try `chr(55349).encode("utf-8")`).

Philipp 2010-07-11 10:22:33

@Philipp: Python 3 does seem to do the "right thing" -- however your example is a LONE surrogate which is a different problem; Python 2 passes that test but not this one: `"\xed\xa0\x80\xed\xb0\x80".decode('utf8')` produces `u'\U00010000'` instead of an exception.

John Machin 2010-07-11 11:51:55

Hmmm... You forgot to add the `u` prefix to all strings! It should have been `u'\ufffd'`. ;)

Denilson Sá 2010-07-16 22:05:46

ansaurus

tags:

views:

answers:

How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

[EDIT] Added tests about the proposed solutions

Conclusion

related questions