views:

73

answers:

3

Hello.

I have some hebrew websites that contains character references like: נוף

I can only view these letters if I save the file as .html and view in UTF-8 encoding.

If I try to open it as a regular text file then UTF-8 encoding does not show the proper output.

I noticed that if I open a text editor and write hebrew in UTF-8, each character takes two bytes not 4 bytes line in this example (ו)

Any ideas if this is UTF-16 or any other kind of UTF representation of letters?

How can I convert it to normal letters if possible?

Using latest PHP version.

+3  A: 

Those are XML Character References. You want to decode them using html_entity_decode():

$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');

For more information, you can search Google for the entity in question. See these few examples:

  1. Hebrew Characters
  2. HTML Entities for Hebrew Characters
  3. UTF-8 Encoding Table with HTML entities
ircmaxell
Those are *not* entities, not even entity references. Those are just character references.
Gumbo
@Gumbo: Fair enough. They are not using the named entity... But the concept is nearly identical (except that no map is needed). I'll edit the answer to reflect that...
ircmaxell
+2  A: 

Those are character references that refer to character in ISO 10646 by specifying the code point of that character in decimal (&#n;) or hexadecimal (&#xn;) notation.

You can use html_entity_decode that decodes such character references as well as the entity references for entities defined for HTML 4, so other references like <, >, & will also get decoded:

$str = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');

If you just want to decode the numeric character references, you can use this:

function foo($match) {
    if (strtolower($match[1][0]) === 'x') {
        $codepoint = hexdec(ltrim(substr($match[1], 1), '0'));
    } else {
        $codepoint = (int)ltrim($match[1], '0');
    }
    return mb_convert_encoding(pack('N', $codepoint), 'UTF-8', 'UTF-16BE');
}
$str = preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', 'foobar', $str);
Gumbo
Any particular reason for using `mb_convert_encoding` instead of `iconv`?
ircmaxell
@ircmaxell: No.
Gumbo
ou can turn a string represented by a local character set into the one represented by another character set, which may be the Unicode character set. Supported character sets depend on the iconv implementation of your system.
RobertPitt
@Gumbo, fair enough. It's not bad, I was just more curious as to the choice... +1
ircmaxell
+1  A: 

How do I achieve such goal with libcurl + libiconv?

First I need to decode the string by using: curl_easy_unescape

and then use iconv to convert ISO 10646 to utf-8?

If so, iconv sample doing so would be useful.

Thanks

embedded