views:

61

answers:

4

Hi all,

I'm wondering whether major SQL engines out there (MS SQL, Oracle, MySQL) have the ability to understand that 2 words are related because they share the same root.

We know it's easy to match "networking" when searching for "network" because the latter is a substring of the former.

But do SQL engines have functions that can match "network" when searching for "networking"?

Thanks a lot.

A: 

I think the topic is 'Semantic Similarity'. There are several efforts trying to find optimal solutions to this problem.

Randy
are you aware of any SQL implementation of this?
no. i think this is current research - not generally available in a product. -- unfortunately.
Randy
Actually, this is called lemmatizing and is considered near-solved (though it takes some heavy-duty machine-learned NLP to do it right). Stemming is the lightweight, heuristic version of lemmatizing. Semantic similarity is an even broader topic that is unsolved (and may be AI-complete). http://stackoverflow.com/questions/1787110/what-is-the-true-difference-between-lemmatization-vs-stemming
larsmans
+1  A: 

You can try using soundex, though it might not be exactly what you want. See http://www.codeproject.com/KB/database/Phonetic_Search_MSSQL.aspx.

Ole Melhus
+5  A: 

This functionality is called a stemmer: an algorithm that can deduce a stem from any form of the word.

This can be quite complex: for instance, Russian words шёл and иду are different forms of the same verb, though they have not a single common letter (ironically, this is also true for English: went and go).

Word breaking can also be quite a complex task for some languages that use no spaces between words.

SQL Server allows using pluggable stemmers and word breakers for its fulltext search engine:

http://msdn.microsoft.com/en-us/library/ms142509.aspx

Quassnoi
exactly what I was looking for. GREAT!
+1  A: 

As Quassnoi pointed out, this can be done with stemming. PostgreSQL implements it for full-text search if you turn it on.

ALTER TEXT SEARCH CONFIGURATION blah_en ADD MAPPING FOR english_stem;

This uses the Snowball dictionary, which is based on the Porter stemmer. The Porter stemmer is probably one of the most widely used stemmers, so it will give decent results. It's important to remember, though, that stemming is not always as accurate as you might like.

ealdent