tags:

views:

202

answers:

2

I have sorted list of strings that I move between php and java. to be able to bsearch on this data, I need the same comparison function.

any idea what string compare functions I can use that will always give the same result in both? eg php's strcmp() vs java's String.compareTo()

yes I know I could make my own string compare that does char by char carefully, but I was hoping there's a simple answer.

PS, don't care if case sensitive or not, as long as it is consistant.

+1  A: 

since the php code in this case is allowed to be slow, I ended up rolling my own ...

function unicodeStrCmp($s1,$s2)
{
// designed to be same as java's String.compareTo
// not extensivley tested, and doesn't deal with surrogate pairs
$l1 = mb_strlen($s1);
$l2 = mb_strlen($s2);
$i = 0;
while ($i<$l1 && $i<$l2)
{
    $c1 = mb_convert_encoding(mb_substr($s1,$i,1),'utf-16le');
    $c1 = ord($c1[0])+(ord($c1[1])<<8);
    $c2 = mb_convert_encoding(mb_substr($s2,$i,1),'utf-16le');
    $c2 = ord($c2[0])+(ord($c2[1])<<8);
    $res = $c1-$c2;
    if ($res!=0)
        return $res;
    $i++;
}
return $l1-$l2;
}
steelbytes
note: relies on having previously called mb_internal_encoding()
steelbytes
Seems odd that there isn't an existing solution (where's "mb_strcmp" when you need it?), but I couldn't find one either. Looks like this should accurately emulate String.compareTo. +1
David Gelhar
A: 

The other way to do this would be to implement your own 'byte string' class in Java, complete with a compareTo method. The idea would be to avoid converting the byte representations (in UTF8 encoding, or whatever) into Unicode characters, and thereby avoiding the possibility of using the wrong character encoding.

But this would be exceedingly awkward, because all of Java's text handling APIs are based on the String type and are therefore Unicode based (more or less). Besides, if you weren't making any assumptions about character sets or encodings, you wouldn't be able to interpret the bytes in any way; e.g. you couldn't parse out words, etc.

Stephen C