views:

871

answers:

2

To avoid "monster characters", I choose Unicode NCR form to store non-English characters in database (MySQL). Yet, the PDF plugin I use (FPDF) do not accept Unicode NCR form as a correct format; it displays the data directly like:

這個一個例子

but I want it to display like:

這個一個例子

Is there any method to convert Unicode NCR form to its original form?

p.s. the meaning of the sentence is "this is an example" in Traditional Chinese.

p.s. i know NCR form wastes storage space, but it is the safest method to store non-English characters. Correct me if I am wrong. thanks.

A: 

Take a look at html_entity_decode.

PS: The better way would be to use UTF-8 all the way through. Search on SO for questions regarding PHP, MySQL and UTF-8, there are a few that list the possible gotchas.

deceze
under FPDF, I am afraid the solution is not that easy. I am getting close to the solution... and will post the solution here.
Shivan Raptor
Hum, FPDF doesn't seem to support anything besides ISO-8859-1 (and thereby no asian characters(?)). Steven Wittens wrote an experimental extension to add UTF8 support: http://acko.net/node/56
deceze
A: 

The solution is very complicated.

There are 3 parts of the solution: Part 1: Install FPDF Chinese Plug-in Part 2: Convert NCR format to UTF-8 Part 3: Convert UTF-8 format to BIG5 (or any target encoding)

Part 1

I fetched the FPDF Chinese Plug-in from here: http://dev.xoofoo.org/modules/content/d1/d6e/a00073.html It is used to display Chinese characters in FPDF, and fetches all the Chinese fonts needed. To install this plug-in, just include it in PHP. (but for my case, I use another plug-in named CellPDF, which crashes with this Chinese Plug-in; thus, I have to merge the codes and resolve the conflicts)

Part 2

To convert NCR format to UTF-8, I use the following codes:

function html_entity_decode_utf8($string)
{
    static $trans_tbl;

    // replace numeric entities
    $string = preg_replace('~&#x([0-9a-f]+);~ei', 'code2utf(hexdec("\\1"))', $string);
    $string = preg_replace('~&#([0-9]+);~e', 'code2utf(\\1)', $string);

    // replace literal entities
    if (!isset($trans_tbl))
    {
        $trans_tbl = array();

        foreach (get_html_translation_table(HTML_ENTITIES) as $val=>$key)
            $trans_tbl[$key] = utf8_encode($val);
    }

    return strtr($string, $trans_tbl);
}
function code2utf($num)
{
    if ($num < 128) return chr($num);
    if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
    if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
    if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
    return '';
}

which is written by laurynas butkus at php.net (link: http://www.php.net/manual/en/function.html-entity-decode.php) Though this piece of code itself converts NCR format to "monster characters", I know it is a good start.

Part 3

After I digged deep in php.net, I found a nice function: iconv, to convert encoding. So I wrap the above codes with the following function:

function ncr_decode($string, $target_encoding='BIG5') {
    return iconv('UTF-8', 'BIG5', html_entity_decode_utf8($string));
}

Therefore, if I want to convert the previous line of NCR strings, I only need to run this function:

ncr_decode("&#36889;&#20491;&#19968;&#20491;&#20363;&#23376;");

p.s. by default, I set the target encoding to BIG5.

That's it!

Shivan Raptor
If you're using PHP version >= 4.3 you should be able to replace the bulk of your code with `html_entity_decode`, which will even output in BIG5 if you tell it to. Or am I missing something?
deceze
deceze