views:

91

answers:

2

Normally I would just do this.

$str = preg_replace('#(\d+)#', ' $1 ', $str);

If I knew it was going to be utf-8 I would add a lowercase "u" modifier to the pattern and I think I would be good. But because of reports of utf-8 taking 2x and in some cases 3x the storage space than it would take if the native character set were used, I'm trying not to restrict the application to utf-8.

Thus, I'm trying to stay away from my favorite preg_ functions.

Most things have been fairly simple so far, but I'm a little stuck on replacements where I'd normally use character classes in preg_ such as "\d".

+2  A: 

Implement a storage wrapper with mb_convert_encoding so internally you only have to manipulate UTF-8.

(I still think you should require UTF-8 and save everyone a lot of trouble.)

mrclay
I think what I'm ultimately going to end up doing here is continue on with the script in a way where all of the base functionality will continue to use the use the mb_ functions in a way that the encoding can be changed, and flag a few advanced features so that they're only available when the active encoding is utf-8.
joebert
+1  A: 

I think that UTF-8 encoding is such that anything in the encoded output with a byte value of 127 or less is always the ASCII character matching that byte value and never part of a multi byte sequence. So you can continue to pretend the encoding is ASCII in this situation and not cause problems (as spaces and digits are all ASCII).

See the description in http://en.wikipedia.org/wiki/UTF-8 where it shows that all the bytes in a multibyte sequence have the most significant bit set (e.g. are all > 127).

William Rose
This is definitely true for UTF-8 and ISO-8859-n, but I think he was specifically worrying about wider encodings that would store e.g. Asian text more compactly. (I don't think it's worth worrying about; require UTF-8, live happily.)
mrclay