views:

444

answers:

2

I have a MySQL database with book titles in both English and Arabic and I'm using a PHP class that can automatically transliterate Arabic text into Latin script.

I'd like my output HTML to look something like this:

<h3>A book</h3>
<h3>كتاب <em>(kitaab)</em></h3>
<h3>Another book</h3>

Is there a way for PHP to determine the language of a string based on the Unicode characters and glyphs used in it? I'm trying to get something like this:

$Ar = new Arabic('EnTransliteration');
while ($item = mysql_fetch_array($results)) {
    ...
    if (some test to see if $item['item_title'] has Arabic glyphs in it) {
      echo "<h3>$item[item_title] <em>(" . $Ar->ar2en($item['item_title']) . ")</em></h3>";
    } else {
      echo "<h3>$item[item_title]</h3>";
    }
    ...
}

Fortunately the class doesn't choke when fed Latin characters, so in theory I could send every result through the transformation, but that seems like a waste of processing.

Thanks!

Edit: I still haven't found a way to check for glyphs or characters. I suppose I could put all the Arabic characters in an array and check if anything in the array matches a part of the string...

I did, however, figure out an interim solution that might work fine in the end. It puts every title through the transformation regardless of language, but only outputs the parenthetical transliteration if the string was changed:

while ($item = mysql_fetch_array($mysql_results)) {
    $transliterate = trim(strtolower($Ar->ar2en($item['item_title'])));
    $item_title = (strtolower($item['item_title']) == $transliterate) ? $item['item_title'] : $item['item_title'] . " <em>($transliterate)</em>";

    echo "<h3>$item_title</h3>";
}
A: 

Here's an PHP open source class for Arabic character set auto detection:

http://www.ar-php.com/php/arabic/index.html#ArCharsetD

karim79
The database fields are all set with `utf8_unicode_ci` collation. Does that mean that they are all utf-8 encoded?
Andrew
I just realised that my answer won't work, I'll edit it now.
karim79
That's actually the same class I'm using for the transliteration. Sadly, though, the ArCharsetD chokes on any English strings I feed it...
Andrew
+3  A: 

This should do it:

preg_match("/\p{Arabic}/u", $item['item_title'])

You could make that regular expression a bit more sophisticated if you want to, but I don't think you really need to.

The \p escape sequence lets you select characters based on their Unicode properties (when the u pattern modifier is used).

The PHP manual mentions: "Extended properties such as "Greek" or "InMusicalSymbols" are not supported by PCRE." But that's not entirely true anymore. PCRE release 6.5 added support for script names.

mercator
Wow! What is the \p modifier? I've never seen that! It works perfectly though! I've noticed that in some server configurations it won't work right because of the PCRE configuration. Is this true?
Andrew
I've clarified my answer. I presume some servers have an older PCRE version?
mercator
Yeah, I think that was the main issue I found in my Google research--some PHP configurations use Apache's PCRE rather than PHP's newer, fancier one, so preg_match()es with /p (or a whole host of other modifiers) will fail. I think it's pretty rare, though; all my servers use 7.0 (most 7.8 even).
Andrew