ansaurus

Question

Replace "abc123def" with "abc 123 def" in multibyte string

Answer 1

+2 A:

Implement a storage wrapper with mb_convert_encoding so internally you only have to manipulate UTF-8.

(I still think you should require UTF-8 and save everyone a lot of trouble.)

mrclay 2009-10-09 04:18:19

I think what I'm ultimately going to end up doing here is continue on with the script in a way where all of the base functionality will continue to use the use the mb_ functions in a way that the encoding can be changed, and flag a few advanced features so that they're only available when the active encoding is utf-8.

joebert 2009-10-24 11:47:04

Answer 2

+1 A:

I think that UTF-8 encoding is such that anything in the encoded output with a byte value of 127 or less is always the ASCII character matching that byte value and never part of a multi byte sequence. So you can continue to pretend the encoding is ASCII in this situation and not cause problems (as spaces and digits are all ASCII).

See the description in http://en.wikipedia.org/wiki/UTF-8 where it shows that all the bytes in a multibyte sequence have the most significant bit set (e.g. are all > 127).

William Rose 2009-10-09 04:35:45

This is definitely true for UTF-8 and ISO-8859-n, but I think he was specifically worrying about wider encodings that would store e.g. Asian text more compactly. (I don't think it's worth worrying about; require UTF-8, live happily.)

mrclay 2009-10-19 17:54:12

ansaurus

tags:

views:

answers:

Replace "abc123def" with "abc 123 def" in multibyte string

related questions