views:

143

answers:

3

Hello guys,

I'm having this problem with UTF8 string comparison which I really have no idea about and it starts to give me headache. Please help me out.
Basically I have this string from a xml document encoded in UTF8: 'Mina Tidigare anställningar'
And when I compare that string with the exactly the same string which I typed myself: 'Mina Tidigare anställningar' (also in UTF8). And the result is FALSE!!!
I have no idea why. It is so strange. Can someone help me out?

+1  A: 

Let's try blindly: maybe both UTF-8 strings have not the same underlying representation (you can get characters with accents as a sequence or as a unique character). You should give use some hex dump of both UTF8 strings and someone may be able to help.

kriss
Hej hej kriss, thank you. This is the hex dump of the str from xml file '4d696e6120546964696761726520616e7374c3a46c6c6e696e676172'. And this is of the string I typed myself '4d696e61205469646967617265c2a0616e737461cc886c6c6e696e676172'.
James
Obviously they are different... problem seems to be in the string you typed yourself. In the xml string you get 20 (space) but in your file c2a0 (whatever ? I should decode). But obviously it's not the same.
kriss
+12  A: 

This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř can be written as one character ř or as two characters: r and the combining ˇ.

Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.

In one of the comments, you show these hex representations of the strings:

4d696e61205469646967617265 20   616e7374 c3a4   6c6c6e696e676172  // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
        ^^-----------------^^^^1         ^^^^^^2

Note the parts I marked, apparently there are two parts to this problem.

  • For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.

  • As to the second, that is the case I outlined above: c3a4 is ä (one character, two bytes), whereas 61 is a and cc88 would be the combining umlaut " (two characters, three bytes). Here, the normalization library should be useful.

Piskvor
In this case, a Unicode-aware string comparison library should be able to understand that c3a4 == 61cc88. However I doubt it would consider your non-breaking space to be equal to a normal space. Unless you told it to ignore differences between whitespace. You would need to ask your text editor, browser, or wherever you typed the space, why it translated it to nbsp.
LarsH
@LarsH: With emphasis on the *should* - PHP internally works with bytes, not characters, so I assume you'd have to do `Normalizer::normalize($string1) == Normalizer::normalize($string2)`, or normalize the strings when you load them.
Piskvor
@Piskvor: Right... I wasn't trying to imply that PHP's internal string-comparison routines are Unicode-aware.
LarsH
@LarsH: Even worse - most of PHP's internal functions operate on bytes (I could live with that), but some operate on characters, where the charset is apparently influenced by the phase of the moon (it's somewhere deep in php.ini, and I suspect slight bugginess in some cases). If you can help it, don't do anything with strings in PHP beyond concatenation, and even then be careful.
Piskvor
@Piskvor That's not accurate. That are some functions which depend on the locale. Unfortunately, the manual sometimes omits this information...
Artefacto
Thank Piskvor. I have installed intl extension and used Normalizer class to sove the problem. : D
James
@James: You're welcome.
Piskvor
A: 

mb_detect_encoding($s, "UTF-8") == "UTF-8" ? : $s = utf8_encode($s);

DmitryK
Both returned UTF-8...
James