Pattern comparison
I used the following to auto-correct some domain names.
The idea is to look at small patterns, for instance 2 chars sequences. Each time that such a sequence is found, the "score" is incremented for the compared sequence.
The highest scores are likely to look similar.
Ex:
Mozilla Firefox
=> ['mo', 'oz', 'zi', 'il', 'll', 'la', 'a ', ' f', 'fi', 'ir', 're', 'fo', 'ox']
Results:
- 'Firefox 3.5' => 5,
- 'Adobe Reader' => 0,
- 'Adobe Acrobat Reader v10' => 1
Automatic classification using compression
This one is not full-text based.
The idea here, expressed in this document, is to compare the compression of the concatenation of two items with the concatenation of the compressed items.
Let c be the function that returns the size of the compressed item:
d = c(A) + c(B) - c(A+B)
The smaller d is, the closer A and B are.
An interesting property is that the principle is type-independent and can be used with binaries like music, pictures, videos etc.
Another link, easier to read but in French.
Use SGDB full text functionalities
I am a bit rusty on SQL Server but SQLite or MySQL offer a full text search.
The results include a "rank" value, which can be considered as a similarity score.
In MySQL:
SELECT
t.*,
MATCH(my_field) AGAINST 'Mozilla Firefox' as relevance
FROM
table t
WHERE
MATCH(my_field) AGAINST 'Mozilla Firefox'
ORDER BY relevance DESC
LIMIT 100