ansaurus

Question

Full Text Search: find similar software names

Answer 1

+2 A:

Pattern comparison

I used the following to auto-correct some domain names.

The idea is to look at small patterns, for instance 2 chars sequences. Each time that such a sequence is found, the "score" is incremented for the compared sequence. The highest scores are likely to look similar.

Ex: Mozilla Firefox => ['mo', 'oz', 'zi', 'il', 'll', 'la', 'a ', ' f', 'fi', 'ir', 're', 'fo', 'ox']

Results:

'Firefox 3.5' => 5,
'Adobe Reader' => 0,
'Adobe Acrobat Reader v10' => 1

Automatic classification using compression

This one is not full-text based.

The idea here, expressed in this document, is to compare the compression of the concatenation of two items with the concatenation of the compressed items.

Let c be the function that returns the size of the compressed item:

d = c(A) + c(B) - c(A+B)

The smaller d is, the closer A and B are.
An interesting property is that the principle is type-independent and can be used with binaries like music, pictures, videos etc.

Another link, easier to read but in French.

Use SGDB full text functionalities

I am a bit rusty on SQL Server but SQLite or MySQL offer a full text search.
The results include a "rank" value, which can be considered as a similarity score.

In MySQL:

SELECT
  t.*,
  MATCH(my_field) AGAINST 'Mozilla Firefox' as relevance
FROM
  table t
WHERE
  MATCH(my_field) AGAINST 'Mozilla Firefox'
ORDER BY relevance DESC
LIMIT 100

Benoit Vidis 2010-02-03 16:18:40

+1 for the link to the compression papers, very cool!

GalacticJello 2010-02-03 22:26:07

Thanks for the answers, guys!

Sphynx 2010-02-09 19:29:59

ansaurus

tags:

views:

answers:

Full Text Search: find similar software names

Pattern comparison

Automatic classification using compression

Use SGDB full text functionalities

related questions