views:

151

answers:

3

How can I encode strings on UTF-16BE format in PHP? For "Demo Message!!!" the encoded string should be '00440065006D006F0020004D00650073007300610067006'. Also, I need to encode Arabic characters to this format.

A: 

This isn't utf8 encoding. The utf8 encoding for D is 44 and not 0044, etc.

+5  A: 

First of all, this is absolutly not UTF-8, which is just a charset (i.e. a way to store strings in memory / display them).

WHat you have here looks like a dump of the bytes that are used to build each characters.

If so, you could get those bytes this way :

$str = utf8_encode("Demo Message!!!");

for ($i=0 ; $i<strlen($str) ; $i++) {
    $byte = $str[$i];
    $char = ord($byte);
    printf('%02x ', $char);
}

And you'd get the following output :

44 65 6d 6f 20 4d 65 73 73 61 67 65 21 21 21 


But, once again, this is not UTF-8 : in UTF-8, like you can see in the example I've give, D is stored on only one byte : 0x44

In what you posted, it's stored using two Bytes : 0x00 0x44.

Maybe you're using some kind of UTF-16 ?



EDIT after a bit more testing and @aSeptik's comment : this is indeed UTF-16.

To get the kind of dump you're getting, you'll have to make sure your string is encoded in UTF-16, which could be done this way, using, for example, the mb_convert_encoding function :

$str = mb_convert_encoding("Demo Message!!!", 'UTF-16', 'UTF-8');

Then, it's just a matter of iterating over the bytes that make this string, and dumping their values, like I did before :

for ($i=0 ; $i<strlen($str) ; $i++) {
    $byte = $str[$i];
    $char = ord($byte);
    printf('%02x ', $char);
}

And you'll get the following output :

00 44 00 65 00 6d 00 6f 00 20 00 4d 00 65 00 73 00 73 00 61 00 67 00 65 00 21 00 21 00 21 

Which kind of looks like what youy posted :-)

(you just have to remove the space in the call to printf -- I let it there to get an easier to read output=)

Pascal MARTIN
he is using UTF-16BE
aSeptik
@aSeptik : Thanks :-) ;; I've edited my answer to add some informations about that :-)
Pascal MARTIN
he can also check with `mb_detect_encoding('00440065006D006F0020004D00650073007300610067006');` -> ASCII
aSeptik
Ah, and of course +1 for your answer! ;-)
aSeptik
@Pascal Martin: I'd got the dump by using printf("%04x", $char); instead of printf("%02x ", $char); in your first answer. Now I'm confused. What's the difference?
shyam
With %04x, you'll be displaying 4 digits per byte ;;; with %02x, you'll be displaying 2 digits per byte ;;; after that, it's a matter of encoding : with UTF-8, which is what is used in my first portion of code, some characters are stored on one byte, some other are stored on two bytes, some on 3, and, if I remember correctly, some on 4 bytes
Pascal MARTIN
all the characters used in your example strings are "simple" characters, stored on 1 byte when using UTF-8, which explains why the first portion of code doesn't output any `00`. ;;; but with more complex characters, you'll see that you need to iterate byte by byte, and use %02d, to display the value of each byte.
Pascal MARTIN
hmm... ok. Thanks! Was really helpful :)
shyam
You're welcome :-) Have fun !
Pascal MARTIN
@Pascal Martin: I need one more help.. Instead of printing the encoded string, I want to take the whole string to a variable... How can I do it?
shyam
@shyam : you can use the `sprintf` function *( http://fr2.php.net/sprintf )*, instead of `printf` : it will return the result, instead of printing it ;; just concatenate the result each time `sprintf` is called : `$string .= sprintf('...', ...)`
Pascal MARTIN
A: 

E.g. by using the mbstring extension and its mb_convert_encoding() function.

$in = 'Demo Message!!!';
$out = mb_convert_encoding($in, 'UTF-16BE');

for($i=0; $i<strlen($out); $i++) {
  printf("%02X ", ord($out[$i]));
}

prints

00 44 00 65 00 6D 00 6F 00 20 00 4D 00 65 00 73 00 73 00 61 00 67 00 65 00 21 00 21 00 21 

Or by using iconv()

$in = 'Demo Message!!!';
$out = iconv('iso-8859-1', 'UTF-16BE', $in);

for($i=0; $i<strlen($out); $i++) {
  printf("%02X ", ord($out[$i]));
}
VolkerK