views:

103

answers:

2

Hi guys! -- I am working on a (auto) tag suggestion system (NOT tag autocomplete). Lets say I want to suggest tags for a given question like here on SO (although SO's tagging system is auto-complete). My main idea is to get the intersection between the tags_set and the given question.split()_set. (In python the set_intersection is efficient enough). Also, in order to make it a little bit more accurate I might use words-distance to count as 'the same' very close words i.e movie == movies. For now I am not thinking about using any Collaborative Filtering technique looking for the tags to similar questions and so on, because I believe since the question text is pretty short (comparing with a blog article or a paper) it is not worth the effort.

So I was wondering if you have any other (more) efficient approaches to suggest. Any ideas, specially from people who they have done something like that before, are more than welcome.

A: 

It will be easier if you have some type of training library -- groups of words that are associated with a given tag. This means you will need to start out with at least some (preferably a lot) of human entered tags.

For instance, "class", "method", and "private" could all be associated with an "Object-Oriented tag", and so would be members of its word group. The "very close" words would be handled by the same mechanism; they would just be part of the word group. You could even add a weight to each word within the group, so that the word "movies" will match with a "movie" tag more than it would a "theatre" tag, even though it could be in both of their associated word groups

To auto-tag, you would just do your intersection of the question with each tag's associated word group, tally the matches, apply weighting if there is any, and the strongest matches will be the autotags.

Brendan Abel
Yeah I thought about that too and it seems like a very good idea. The only thing is that it will need a lot of "handy" work to create this "library"
Galois
A: 

Do you want to apply tags from a finite list of predefined tags, or do you want to be able to generate new tags from the text?

Applying predefined tags would be much easier, but extremely hard to do accurately. You could easily identify potential tags using statistical analysis of word frequencies, but unless you can live with ambiguous tags (is python a snake or a reptile), you will need to perform contextual analysis.

If you want to extract new unkown tags from text, I hope you have a couple of PhDs working for you.

mikerobi
Actually there is a database with tags and I am also going to be letting the user add their tags as well and then I will be adding them in my database if they dont live there already. Think of something like delicious -- without the auto-completion part.(btw I am a phd guy :) but no, there is no need to extract unknown tags)
Galois