tags:

views:

809

answers:

2

I'd like to use Oracle's utl_match.edit_distance function. It supposed to compare two strings and return the Levenshtein distance.

select utl_match.edit_distance('a','b') from dual

returns 1 as expected, but

select utl_match.edit_distance('á','b') from dual

returns 2. Obviously I'd like to get 1.

It seems to be, it does not work correctly for special characters. I'm using Oracle 10.2.0.4 and AL32UTF8 character set.

+1  A: 

I agree, it appears to be wrong. However, this package is undocumented by Oracle, so is perhaps unsupported at present.

Tony Andrews
+2  A: 

This seems to be related to the character set. If I run the same test in a 10.2.0.3 and 11.1.0.7 database using ISO8859P15 as the character set, I get a distance of 1 as well. I'm guessing that Oracle is computing the distance in terms of bytes rather than characters in variable-width character sets.

You can work around the problem using the CONVERT function to convert to a fixed-width character set (AL16UTF16 or a local character set)

SQL> ed
Wrote file afiedt.buf

  1  declare
  2    l_char1 varchar2(1 char) := 'á';
  3    l_char2 varchar2(1 char) := 'b';
  4  begin
  5    dbms_output.put_line(
  6      'In AL32UTF8: ' ||
  7        utl_match.edit_distance( l_char1, l_char2 ) );
  8    dbms_output.put_line(
  9      'In WE8ISO8859P15: ' ||
 10        utl_match.edit_distance(
 11            CONVERT( l_char1, 'WE8ISO8859P15', 'AL32UTF8' ),
 12            CONVERT( l_char2, 'WE8ISO8859P15', 'AL32UTF8' ) ) );
 13    dbms_output.put_line(
 14      'In AL16UTF16: ' ||
 15        utl_match.edit_distance(
 16            CONVERT( l_char1, 'AL16UTF16', 'AL32UTF8' ),
 17            CONVERT( l_char2, 'AL16UTF16', 'AL32UTF8' ) ) );
 18* end;
SQL> /
In AL32UTF8: 2
In WE8ISO8859P15: 1
In AL16UTF16: 1

PL/SQL procedure successfully completed.
Justin Cave
Thanks for the workaround. Quite strange, but comparing 'a' with 'bb' gives 2 in AL32UTF8, 2 in WE8ISO8859P15, but 3 in AL16UTF16.
asalamon74