ansaurus

Question

Answer 1

+1 A:

Let's try blindly: maybe both UTF-8 strings have not the same underlying representation (you can get characters with accents as a sequence or as a unique character). You should give use some hex dump of both UTF8 strings and someone may be able to help.

kriss 2010-09-03 14:15:28

Hej hej kriss, thank you. This is the hex dump of the str from xml file '4d696e6120546964696761726520616e7374c3a46c6c6e696e676172'. And this is of the string I typed myself '4d696e61205469646967617265c2a0616e737461cc886c6c6e696e676172'.

James 2010-09-03 14:19:52

Obviously they are different... problem seems to be in the string you typed yourself. In the xml string you get 20 (space) but in your file c2a0 (whatever ? I should decode). But obviously it's not the same.

kriss 2010-09-03 15:23:41

Answer 2

+12 A:

This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř can be written as one character ř or as two characters: r and the combining ˇ.

Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.

In one of the comments, you show these hex representations of the strings:

4d696e61205469646967617265 20   616e7374 c3a4   6c6c6e696e676172  // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
        ^^-----------------^^^^1         ^^^^^^2

Note the parts I marked, apparently there are two parts to this problem.

For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.
As to the second, that is the case I outlined above: c3a4 is ä (one character, two bytes), whereas 61 is a and cc88 would be the combining umlaut " (two characters, three bytes). Here, the normalization library should be useful.

Piskvor 2010-09-03 14:17:40

In this case, a Unicode-aware string comparison library should be able to understand that c3a4 == 61cc88. However I doubt it would consider your non-breaking space to be equal to a normal space. Unless you told it to ignore differences between whitespace. You would need to ask your text editor, browser, or wherever you typed the space, why it translated it to nbsp.

LarsH 2010-09-03 14:50:04

@LarsH: With emphasis on the *should* - PHP internally works with bytes, not characters, so I assume you'd have to do `Normalizer::normalize($string1) == Normalizer::normalize($string2)`, or normalize the strings when you load them.

Piskvor 2010-09-03 14:57:40

@Piskvor: Right... I wasn't trying to imply that PHP's internal string-comparison routines are Unicode-aware.

LarsH 2010-09-03 15:38:59

@LarsH: Even worse - most of PHP's internal functions operate on bytes (I could live with that), but some operate on characters, where the charset is apparently influenced by the phase of the moon (it's somewhere deep in php.ini, and I suspect slight bugginess in some cases). If you can help it, don't do anything with strings in PHP beyond concatenation, and even then be careful.

Piskvor 2010-09-03 15:54:36

@Piskvor That's not accurate. That are some functions which depend on the locale. Unfortunately, the manual sometimes omits this information...

Artefacto 2010-09-03 20:58:11

Thank Piskvor. I have installed intl extension and used Normalizer class to sove the problem. : D

James 2010-09-09 13:18:19

@James: You're welcome.

Piskvor 2010-09-09 13:22:02

Answer 3

A:

mb_detect_encoding($s, "UTF-8") == "UTF-8" ? : $s = utf8_encode($s);

DmitryK 2010-09-03 14:18:15

Both returned UTF-8...

James 2010-09-03 14:22:15

ansaurus

tags:

views:

answers:

Strange UTF8 string comparison

related questions