views:

111

answers:

3

Hi all, thanks for the answers to :

"regular expression to detect numbers written as words" :

http://stackoverflow.com/questions/3608159/regular-expression-to-detect-numbers-written-as-words

I now have this working, however I have the same requirement but the numbers as words are in Arabic (or any other UTF-8) and not English, so :

if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/", $str, $matches) > 0) 
   return true;

Does not work - I've googled and there seems to be quite a few issues with preg_match and UTF-8 string but I couldn't get any of the suggestions found to work. Any help much appreciated.

A: 

convert both pattern and $str to windows-1256, do the matching, then convert $matches items back (if needed), this is the solution I came to after suffering for some time.

$pattern="/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/";
$pattern_windows1265 = iconv('utf-8', 'windows-1256', $pattern);
$str_windows1265 = iconv('utf-8', 'windows-1256', $str);
if (preg_match($pattern_windows1265, $str_windows1265, $matches) > 0) 
   return true;

Here's a test example to check if unicode conversion is allowing Arabic letters match in preg_match:

<?php
$pattern="/(واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)/";
$pattern_windows1265 = iconv('utf-8', 'windows-1256', $pattern);


$test_cases=array(
    'لدي أربعة أولاد',
    'قفز الثعلب فوق الشجرة',
    'عندي خمسة أرانب',
);
foreach ($test_cases as $str) {
    $str_windows1265 = iconv('utf-8', 'windows-1256', $str);

    if (preg_match($pattern_windows1265, $str_windows1265, $matches) > 0) {
        echo $str, '<br />';
    }
}

when executing, it will output:

لدي أربعة أولاد
لدي خمسة أرانب

I removed some of the pattern to check if the plain check against Arabic works, which seems to be working.

aularon
thanks, what function did you use to convert ?
Sherif Buzz
I remember running into problems when trying with `mb_convert_encoding`, so I switched to `iconv` instead, I will update the answer with an example.
aularon
edited, try now with that.
aularon
unfortunately, didn't work, it always return nothing found :( is there some php setting i have to change or so ?
Sherif Buzz
I added a test code, try it.
aularon
thank you brother, now it works. Ramadan kareem and thanks a million.
Sherif Buzz
You are welcome : ) Don't forget to mark the answer as **accepted**
aularon
A: 

You can use the pattern modifier u to use any UTF-8 supported language.

if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/u", $str, $matches) > 0) 

Resources :

Colin Hebert
+1  A: 

Note that \b may not be working as you expect. \b specifies a word boundary, but what is considered a word character by PCRE depends on what locale the script is running in (take a look towards the bottom of the PCRE escape sequences manual page):

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

You might also want to read Handling UTF-8 with PHP (the section on PCRE in particular).

Instead, you could use a lookaround in conjunction with a Unicode character property to emulate a word boundary: (?<=\P{L}). This asserts that the previous character is not a unicode "letter".

So all together it would look like:

/(?<=\P{L})(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\s*?){4}/
Daniel Vandersluis