views:

82

answers:

1

Hi,

I'm looking for a full text search algorithm that will allow to find similar program names, for example "Mozilla Firefox" and "Firefox 3.5, or "Adobe Reader" and "Adobe Acrobat Reader v10". The Levenshtein distance is too inefficient in this case, since spelling doesn't change.

It has to use serial scanning (not indexing).

I need maximum precision and minimum errors. What would you recommend?

Thanks!

+2  A: 

Pattern comparison

I used the following to auto-correct some domain names.

The idea is to look at small patterns, for instance 2 chars sequences. Each time that such a sequence is found, the "score" is incremented for the compared sequence. The highest scores are likely to look similar.

Ex: Mozilla Firefox => ['mo', 'oz', 'zi', 'il', 'll', 'la', 'a ', ' f', 'fi', 'ir', 're', 'fo', 'ox']

Results:

  • 'Firefox 3.5' => 5,
  • 'Adobe Reader' => 0,
  • 'Adobe Acrobat Reader v10' => 1

Automatic classification using compression

This one is not full-text based.

The idea here, expressed in this document, is to compare the compression of the concatenation of two items with the concatenation of the compressed items.

Let c be the function that returns the size of the compressed item:

d = c(A) + c(B) - c(A+B)

The smaller d is, the closer A and B are.
An interesting property is that the principle is type-independent and can be used with binaries like music, pictures, videos etc.

Another link, easier to read but in French.

Use SGDB full text functionalities

I am a bit rusty on SQL Server but SQLite or MySQL offer a full text search.
The results include a "rank" value, which can be considered as a similarity score.

In MySQL:

SELECT
  t.*,
  MATCH(my_field) AGAINST 'Mozilla Firefox' as relevance
FROM
  table t
WHERE
  MATCH(my_field) AGAINST 'Mozilla Firefox'
ORDER BY relevance DESC
LIMIT 100
Benoit Vidis
+1 for the link to the compression papers, very cool!
GalacticJello
Thanks for the answers, guys!
Sphynx