Building or Finding a "relevant terms" suggestion feature.

Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a recent Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.

You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.

If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.

The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.

It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have arelatively simple algorithm and a whole lot of data.

Good luck!

WordNet is good, but it will miss out on proper names too:$ wn baseball -overOverview of noun baseballThe noun baseball has 2 senses (first 2 from tagged texts) 1. (21) baseball, baseball game -- (a ball game played with a bat and ...

Hemal Pandya 2009-02-21 02:14:56

drfloob 2009-02-21 02:27:11

Glad it works for you ;-) Many notable (famous) proper names are in WordNet - and for those that are not, I'm sure the database will expand to include more of them over time. (you could probably even contribute to it)

David Zaslavsky 2009-02-21 02:35:25

ansaurus

tags:

views:

answers:

Building or Finding a "relevant terms" suggestion feature.

related questions