tags:

views:

398

answers:

4

Does anyone know of a "similar words or keywords" algorithm available in open source or via an API? I am looking for something sort of like a thesaurus but smarter.

So for example:

intel

returns:

processor,
i7 core chip,
quad core chip,
.. etc

Any ideas or even something to point me in the right direction in C#?


Edit:

I would love to hear your thoughts, but why cant we just use the Google Adwords API to generate keywords relevant to those entered?

+2  A: 

There is no algorithm for such a thing. You are going to have to acquire data for a Thesaurus, and load it into a data structure then it is a simple dictionary lookup (you can use the C# Dictionary class for that). Maybe you can look at Wordnet, or Moby Thesaurus as a source for data. Other options are using a Thesaurus server and getting the information online as needed.

Kris Erickson
+5  A: 

Why not send a search query out to Google and parse what it returns?

Also, check out Google Sets.

Geoffrey Chetwood
yeah this is cool - but no accessible API :(not sure whether it possible to use the google adwords API to access keywords aka - https://adwords.google.com/select/KeywordToolExternal - this technology.or even something like "google suggest" and parse results ?
Andy
@Andy: Sometimes you do not have an API available and you need to do your own screen scraping. This might be one of those times.
Geoffrey Chetwood
@Rich - agreed. seems htmlagilitypack will do the job nicely. thinking the MS link seems to be pretty sweet. thx for the help
Andy
A: 

You will need a large database containing this information. The rest is simple - look up the input and see what releated words are stored.

The hard part is generating the database. Doing it manually might take years if you want to cover a large number of words and topics.
Generating it is surly non-trivial. Maybe you could try to download web pages and analyze words frequently appearing together, but I assume this will still take months to build, tune, and finally gather good quality data. Maybe extracting links from Wikipedia might be a good source of information because of its semi-structure.

Daniel Brückner
A: 

I've made the open office thesaurus functions available for .NET in the NHunspell project. You can use the OO Thesaurus files. Here is the NHunspell Project

Thomas Maierhofer