views:

323

answers:

1

I've been scouring Google and can't seem to find an answer. I'm running Oracle 10g Enterprise with the following character-set: AR8MSWIN1256

The database holds English and Arabic values, and I need to differentiate between the two from time to time. I wrote this script, which matches on English words but not on Arabic words:

create or replace function GET_LANGUAGE (v_value in varchar2)
return varchar2 as
begin
    if (REGEXP_LIKE(v_value, '[\x00-\x7F]+')) then
        return 'ENGLISH';
    -- Arabic
    elsif (REGEXP_LIKE(v_value, '[\xA0-\xF2]+')) then
        return 'ARABIC';
    else
        return 'UNKNOWN';
    end if;
end;

Any suggestions? Thanks!

+1  A: 

I know nothing about Arabic but as far as I can see from the codepage on Wikipedia, your script is mostly correct, except that there seem to be some Arabic characters also in the [\x80-\x9F] range so you might want to use a broader range... also, I don't know if Arabic uses the same ASCII space that English is using, but in that case you'd want to include that as well; and probably other control chars in the [\x00-\x1F] range.

Personally, I would simply avoid the "Arabic elsif" and return 'ARABIC' in the else, since the Arabic match would match the most part of the charset anyways, and that doesn't seem a strong way to tell that the text for sure is neither English or Arabic.

Something like this, also including the few English chars at the end of the charset: (totally untested)

create or replace function GET_LANGUAGE (v_value in varchar2)
return varchar2 as
begin
    if (REGEXP_LIKE(v_value, '[\x00-\x80\x82-\x89\x8B\x8C\x91-\x97\x99\x9B\x9C\xF3-\xFC]+')) then
        return 'ENGLISH';
    else
        return 'ARABIC';
    end if;
end;
lapo