views:

570

answers:

2

In Oracle 10g, I'd like to create a regular expression to list the characters that are different between two strings.

Here is the reason : I have a table with a field that contains sometimes Unicode characters that are not in the french language.

I am able to list the rows containing these non standards characters to make a future cleanup with this query :

SELECT DataID, Name, CONVERT(NAME, 'WE8ISO8859P1', 'WE8DEC')  
  FROM table
  WHERE NAME <> CONVERT(NAME, 'WE8ISO8859P1', 'WE8DEC' )

where WE8ISO8859P1 - West European (that I accept)

and WE8DEC - 8-bit character sets from Digital Equipment Corporation (that i know that the application support)

I imagine that with a Oracle regular expression I would be able to extract the list of all these non standards characters. But I'm not familiar with regexp in Oracle so any help would be appreciated.

Here is my (not working) idea :

select regexp_replace("éaé", '[a-z][A-Z]', '' ) from dual;

would give "é" as a character to cleanup.

+1  A: 

Perhaps something like this might get you going in the right direction:

SQL> select regexp_replace('éaéABcdEF', '([a-zA-Z])', '' ) problems from dual;

PROBLEMS
--------
éé

It gives you each occurrence of the characters you want to identify, but perhaps that's not a problem or you can refine it...

DCookie
A: 

I think your only mistake was using double quotes for the first parameter in your question. Typically double quotes are only used for column/table names.

The parantheses that DCookie added make the regex match multiple times. Depending on what is in the name field, you may need to add more characters to the replace clause, including puncuation (use two single quotes to match a quote). Here's a start for you:

SELECT regexp_replace(name, '([a-zA-Z ,.;''"])\-?','') problem_characters , count(*) 
FROM table
  WHERE NAME <> CONVERT(NAME, 'WE8ISO8859P1', 'WE8DEC' )
group by regexp_replace(name, '([a-zA-Z ,.;''"])\-?','');
Plasmer