tags:

views:

196

answers:

2

I have some xml files with figure spaces in it, I need to remove those with php. The utf-8 code for these is e2 80 a9. If I'm not mistaken php does not seem to like 6 byte utf-8 chars, so far at least I'm unable to find a way to delete the figure spaces with functions like preg_replace.

Anybody any tips or even better a solution to this problem?

+2  A: 

Have you tried preg_replace('/\x{2007}/u', '', $stringWithFigureSpaces);?

U+2007 is the unicode codepoint for the FIGURE SPACE.

Please see my answer on a similar unicode-regex topic with PHP which includes information about the \x{FFFF}-syntax.

Regarding you comment about the non-working - the following works perfectly on my machine:

$ php -a
Interactive shell

php > $str = "a\xe2\x80\x87b";  // \xe2\x80\x87 is the FIGURE SPACE
php > echo preg_replace('/\x{2007}/u', '_', $str); // \x{2007} is the PCRE unicode codepoint notation for the U+2007 codepoint
a_b

What's you PHP version? Are you sure the character is a FIGURE SPACE at all? Can you run the following snippet on your string?

for ($i = 0; $i < strlen($str); $i++) {
    printf('%x ', ord($str[$i]));
}

On my test string this outputs

61 e2 80 87 62
a  |U+2007|  b

EDIT after OP comment:

\xe2\x80\xa9 is a PARAGRAPH SEPARATOR which is unicode codepoint U+2029, so your code should be preg_replace('/\x{2029}/u', '', $stringWithUglyCharacter);

Stefan Gehrig
Yeah I did, it did nothing :(
Jeroen Beerstra
I guess it's something else then, it took me some time to get the right characters pasted (the orig file is rather large) but it clearly prints: e2 80 a9 and not: e2 80 87
Jeroen Beerstra
Somehow I got stuff confused, been strugling with this for a while now. Sorry about that, thanks you very much for your assistence!!
Jeroen Beerstra
A: 

Maybe mb_convert_encoding function can help.

turbod