views:

65

answers:

4

Hi all,

Can someone shed some light on how searching is done on web-sites like del.icio.us?

If I enter "js"(1), "javascript"(2) or "java script"(3) as my query on delicious, I'm pointed to resources about Java Script. However, depending on the query the returned result sets are different(del.icio.us system returns different set of bookmarks for "js" and "javascript" queries).

So it appears the system is not really aware that (1) and (2) are synonymous of each other. Instead, it tries to match my query against bookmarks that contain the query string in either associated tags or the title. Is that correct?

How would you "educate" the system that all (1), (2), (3) are in fact synonyms, and regardless of the the chosen query the user should see all Java Script related resources?

Is it even a good idea to do that?

Thanks, Greg

+1  A: 

Yes: The human brain.

Seriously: Programmatically telling Synonyms from closely related topics is going to be very, very difficult IMO. There are tag combinations that are extremely likely to appear together, say javascript and jquery. Granted, you may be able to do something with the information that, say, jquery never occurs without javascript and therefore must be some sort of subset to it but then, in reality, it does occur on its own, as well. XML and XSLT will appear very often together if properly tagged, but are not synonymous and to know this, you need somebody with actual knowledge of the technologies to make the call.

I would suggest a pre-filtering system that finds candidates for synonyms, and an administrator doing the actual synonymizing.

Pekka
I heart the brain. And this is a great post. Plus one.
Jason
A: 

There is no perfect solution. You could explicitly declare keywords to be synonyms, everything else will be more or less guesswork.

One approach might be to use a distance metric. In the case of delicious you would aggregate the number of times two keywords are applied to the same bookmarks.

You may get allot of false positives though. For example it may be that "ruby" is used less often together with "rails" than vis versa, because "rails" implies "ruby" but "ruby" not "rails". This may be a useful property to weed out related terms from synonyms, which should be in use more or less interchangeably.

mbarkhau
A: 

You also might try tapping into WordNet

Pace
A: 

You could use a tool like LSA or TFIDF to try and find out what concepts are contained in your data. This is most likely what del.icio.us does.

Pace