ansaurus

Question

regular expression to detect numbers written as words - UTF-8 input

Answer 1

A:

convert both pattern and $str to windows-1256, do the matching, then convert $matches items back (if needed), this is the solution I came to after suffering for some time.

$pattern="/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/";
$pattern_windows1265 = iconv('utf-8', 'windows-1256', $pattern);
$str_windows1265 = iconv('utf-8', 'windows-1256', $str);
if (preg_match($pattern_windows1265, $str_windows1265, $matches) > 0) 
   return true;

Here's a test example to check if unicode conversion is allowing Arabic letters match in preg_match:

<?php
$pattern="/(واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)/";
$pattern_windows1265 = iconv('utf-8', 'windows-1256', $pattern);


$test_cases=array(
    'لدي أربعة أولاد',
    'قفز الثعلب فوق الشجرة',
    'عندي خمسة أرانب',
);
foreach ($test_cases as $str) {
    $str_windows1265 = iconv('utf-8', 'windows-1256', $str);

    if (preg_match($pattern_windows1265, $str_windows1265, $matches) > 0) {
        echo $str, '<br />';
    }
}

when executing, it will output:

لدي أربعة أولاد
لدي خمسة أرانب

I removed some of the pattern to check if the plain check against Arabic works, which seems to be working.

aularon 2010-09-02 18:41:05

thanks, what function did you use to convert ?

Sherif Buzz 2010-09-02 18:58:39

I remember running into problems when trying with `mb_convert_encoding`, so I switched to `iconv` instead, I will update the answer with an example.

aularon 2010-09-02 19:01:26

edited, try now with that.

aularon 2010-09-02 19:09:20

unfortunately, didn't work, it always return nothing found :( is there some php setting i have to change or so ?

Sherif Buzz 2010-09-02 20:48:48

I added a test code, try it.

aularon 2010-09-02 21:09:45

thank you brother, now it works. Ramadan kareem and thanks a million.

Sherif Buzz 2010-09-02 21:26:38

You are welcome : ) Don't forget to mark the answer as **accepted**

aularon 2010-09-02 21:36:44

Answer 2

A:

You can use the pattern modifier u to use any UTF-8 supported language.

if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/u", $str, $matches) > 0)

Resources :

Pattern modifiers

Colin Hebert 2010-09-02 21:16:25

Answer 3

+1 A:

Note that \b may not be working as you expect. \b specifies a word boundary, but what is considered a word character by PCRE depends on what locale the script is running in (take a look towards the bottom of the PCRE escape sequences manual page):

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

You might also want to read Handling UTF-8 with PHP (the section on PCRE in particular).

Instead, you could use a lookaround in conjunction with a Unicode character property to emulate a word boundary: (?<=\P{L}). This asserts that the previous character is not a unicode "letter".

So all together it would look like:

/(?<=\P{L})(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\s*?){4}/

Daniel Vandersluis 2010-09-02 21:24:58

ansaurus

tags:

views:

answers:

regular expression to detect numbers written as words - UTF-8 input

related questions