views:

102

answers:

3

I have an implementation of the jaro-winkler algorithm in my database. I did not write this function. The function compares two values and gives the probability of match.

So jaro(string1, string2, matchnoofchars) will return a result.

Instead of comparing two strings, I want to send one string with a matchnoofchars and then get a result set with the probability higher than 95%.

For example the current function is able to return 97.62% for jaro("Philadelphia","Philadelphlaa",9)

I wish to tweak this function so that I am able to find "Philadelphia" for an input of "Philadelphlaa". What kind of changes do I need to make for this to happen?

I am using Oracle 9i.

+2  A: 

Do you have a list of words that contain words like "Philadelphia"?

And who did write that function?

Oracle has package utl_match for fuzzy text comparison: http://download.oracle.com/docs/cd/E14072_01/appdev.112/e10577/u_match.htm

Can't you do

select w1.word from words w1 where jaro(w1.word,'Philadelphlaa', 9) >= 0.95

?

This will select 'Philadelphia' if that word is present in table words.

TTT
The function was written by some one else.I have been able to use it in a Pl sql block.
abhi
Ok, do you have a list of words? It is hard to understand the question. Where is the word 'Philadelphia' supposed to come from?
TTT
@TTT, the list of words is going to come from two different table. A master table which has the words with the correct spellings and a transacation table which could have misspelled words. I have posted the answer. My answer is not efficient as it takes a long time to process stuff.
abhi
We are using Oracle 9i. Is the UTL match package available for 9i?
abhi
Can't find a reference to utl_match in the Oracle 9.2 docs, Oracle 9.2 is ten years old.
TTT
A: 
DECLARE
  CURSOR citynames IS
    SELECT city FROM table_loc_master where statecode = 'PQ';
  CURSOR leasecity IS
    SELECT city FROM table_loc where State = 'PQ'
    MINUS
    SELECT to_char(city) city FROM table_loc_master where statecode = 'PQ';
  xProb NUMBER(10,8);
BEGIN
  FOR x_rec IN leasecity
  LOOP
      FOR y_rec IN citynames
      LOOP
            xProb := jwrun(x_rec.city,y_rec.city,length(y_rec.city));
            If xProb > 0.97 Then
               DBMS_OUTPUT.PUT_LINE('Source : ' || x_rec.city || ' Target: ' || y_rec.city );
            End if;
      END LOOP;
  END LOOP;
END;
abhi
You can speed it up by assuming that the the first 3 or 4 or 5.. characters are the same. You could also assume both words have a roughly equal length. That limits the search.
TTT
+1  A: 

A little dirty but faster (untested!).

Let's assume first three characters are the same and length is also approximately the same.

DECLARE
  CURSOR citynames(cp_start in varchar2, cp_length in number) IS
    SELECT city FROM table_loc_master where statecode = 'PQ'
    and   city like cp_start||'%'
    and   length(city) between cp_length -2 and cp_length +2;
  CURSOR leasecity IS
    SELECT city FROM table_loc where State = 'PQ'
    MINUS
    SELECT to_char(city) city FROM table_loc_master where statecode = 'PQ';
  xProb NUMBER(10,8);
BEGIN
  FOR x_rec IN leasecity
  LOOP
      FOR y_rec IN citynames(substr(x_rec.city,1,3), length(x_rec.city))
      LOOP
            xProb := jwrun(x_rec.city,y_rec.city,length(y_rec.city));
            If xProb > 0.97 Then
               DBMS_OUTPUT.PUT_LINE('Source : ' || x_rec.city || ' Target: ' || y_rec.city );
            End if;
      END LOOP;
  END LOOP;
END;
TTT
That depends on where the city name would be misspelled. The user input is not a selection, but a keyed in word. That causes a lot of city names to go haywire.Thank you for responding to my question.
abhi