views:

38

answers:

2

I am having a problem dealing with a simple search for a two character unicode string (the needle) inside another string (the haystack) that may or may not be UTF-8

Part of the problem is I don't know how to specify the code for use in strpos, and I don't know if PHP has to be compiled with any special support for the code, or if I have to use mb_strpos which I am trying to avoid since it also might not be available.

ie. for example the needle is U+56DE U+590D (without the space)

With preg_match it might be preg_match("@\x{56DE}\x{590D}@",$haystack) but that actually requires @u which might not be available and I get a Compilation failed: character value in \x{...} sequence is too large anyway.

I don't want to use preg_match anyway as it might be significantly slower than strpos (there are other sequences that have to be searched).

Can I convert U+56DE U+590D into its single byte sequence (possibly 5-6 characters) and then search for it via strpos? I can't figure out how to convert it to bytes if so.

How do you specify unicode inline in PHP anyway? I mean outside of PRCE ?

$blah="\u56DE\u590D"; doesn't work?

Thanks for any ideas!

A: 

You wrote 'might not be available'. I suggest you to try mb_strpos.

fabrik
+1  A: 

First, your question is poorly structured. It has several questions at several points. You would probably get more answers if you used a more clear structure: 1) describe the task you're trying to accomplish, 2) the limitations/requirements, 3) the strategy you considered, 4) the difficulties you found with such strategy/is there a better one.

That said, I'll start by the end:

$blah="\u56DE\u590D"; doesn't work?

No. The language doesn't know anything about unicode. In PHP, strings are byte arrays. Therefore, how you express a unicode code points in a PHP script depends on the encoding you want to use. For UTF-8, it would be "\xE5\x9B\x9E\xE5\xA4\x8D", for UTF-16 big endian would be "\x56\xDE\x59\x0D", and so on.

Can I convert U+56DE U+590D into its single byte sequence (possibly 5-6 characters) and then search for it via strpos? I can't figure out how to convert it to bytes if so.

For, the first part, yes, i.e., converting U+56DE U+590D into bytes, a clarification is needed. Are these UTF-16 code units or Unicode code points? For instance, how is represented? U+D869 U+uDED6 or U+2A6D6? If they are unicode code units, it's trivial to encode them into UTF-16. For UTF-16 big endian, it's just "\x56\xDE\x59\x0D". Otherwise, it's still trivial to encode them UTF-32, but it takes a little more work to do the same in UTF-16 (or UTF-8).

For the second part, keep reading.

Part of the problem is I don't know how to specify the code for use in strpos, and I don't know if PHP has to be compiled with any special support for the code, or if I have to use mb_strpos which I am trying to avoid since it also might not be available.

What are you trying to do? Why do you need to find a position in a string? strpos will give you a byte offset for a given string (again, interpreted in binary form). Are you trying to clip a string? strpos (or even mb_strpos) mean trouble in Unicode – a glyph can be constituted by several code units, so you risk clipping part of a glyph. I can't advise you more unless you tell what you're trying to do.

Artefacto
Many thanks for the detailed reply. Sorry about the unfocused question, I should not have tried to ask it at 4am! All I want to do is detect if the needle exists in the haystack, I don't even need the position.I'll try to post the original unicode here at the end (but really it's just an example of a general question I have about the rare but necessary need to search for unicode inside unknown strings). Please let me know how you create the conversion to bytes if it's not obvious. String follows: 回复 In html-entities it's 回复 (google helpfully converts it)
_ck_
@__ck__ You can convert html-entites to your encoding of choice (e.g. utf-8) with `mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES')`. If you want to see a byte representation, you can do `unpack("H*", $convString)` to the result of that.
Artefacto
@__ck__ As to your other question, as long as the needle and the haystack are in UTF-8, it's safe to search for it in a binary manner with `strpos`. This is **not true** for UTF-16 or UTF-32. You may want, however, to normalize both strings first. See http://www.php.net/Normalizer and the links under "See Also" in that page.
Artefacto
Thanks to your persistence and patience, I figured out exactly what I really wanted, how to convert the unicode directly into it's single byte representation so I could do a binary match instead of relying on mb functions in PHP. I made a little tool with input form and converted with this: `$unpack=unpack("H*",$_GET['unicode']); echo'\x'.implode(str_split($unpack[1],2),'\x');` Which shows me your original six bytes: `\xe5\x9b\x9e\xe5\xa4\x8d` but now I really understand how to get there with any unicode (actually UTF-8 I think). Thanks again!
_ck_