views:

516

answers:

3

This is a real issue that applies on tagging items in general (and yes, this applies to StackOverflow too, and no, it is not a question about StackOverflow).

The whole tagging issue helps cluster similar items, whatever items they may be (jokes, blog posts, so questions etc). However, there (usually but not strictly) is a hierarchy of tags, meaning that some tags imply other tags too. To use a familiar example, the "c#" so tag implies also ".net"; another example, in a jokes database, a "blondes" tag implies the "derisive" tag, similarly to "irish" or "belge" or "canadian" etc depending on the joke's country origin.

How have you handled this, if you have, in your projects? I will supply an answer describing two different methods I have used in two separate cases (actually, the same mechanism but implemented in two different environments), but I am also interested not only on similar mechanisms, but also on your opinion on the hierarchy issue.

+1  A: 

The mechanism I have implemented was to not use the tags given themselves, but an indirect lookup table (not strictly DBMS terms) which links a tag to many implied tags (obviously, a tag is linked with itself for this to work).

In a python project, the lookup table is a dictionary keyed on tags, with values sets of tags (where tags are plain strings).

In a database project (indifferent which RDBMS engine it was), there were the following tables:

[Tags]
tagID integer primary key
tagName text

[TagRelations]
tagID integer # first part of two-field key
tagID_parent integer # second part of key
trlValue float

where the trlValue was a value in the (0, 1] space, used to give a gravity for the each linked tag; a self-to-self tag relation always carries 1.0 in the trlValue, while the rest are algorithmically calculated (it's not important how exactly). Think the example jokes database I gave; a ['blonde', 'derisive', 0.5] record would correlate to a ['pondian', 'derisive', 0.5] and therefore suggest all derisive jokes given another.

ΤΖΩΤΖΙΟΥ
+1  A: 

This is a tough question. The two extremes are an ontology (everything is hierarchical) and a folksonomy (tags have no hierarchy). I have answered this on WikiAnswers, with a reference to Clay Shirky's "Ontology is Overrated" article which claims you should set no hierarchy.

Yuval F
Clay Shirky's article was very interesting. Obviously, the proximity factor (in the database example) was introduced to soften relating terms (an in the article example of 'gay' and 'queer').
ΤΖΩΤΖΙΟΥ
+3  A: 

Actually I would say that it is not so much a hierarchical system but a semantic net with felt distancies between tags meanings. What do I mean: mathematics is closer to experimental physics then to gardening.

Possibility to build such a net: Build pairs of tags and let people judge the perceived distance (using a measure like 1-10, meaning something like [synonyms, alike,...,antonyms], ...) and when searching, search for all tags within a certain distance.

Does a measure have to be equal distance if coming from the oposite direction ([a,b] close -> [b,a,] close)? Or does proximity imply [a,b] close and [b,c] close -> [a,b] close?

Maybe the first word will by default trigger another semantic field? If you start at "social worker", "analyst" ist near. If you start at "programmer", "analyst" is near as well. But starting at any of these points, you probably would not count the other as near ("sozial worker" is by no means close to "programmer").

You therefore would have only pairs judged and judged in both directions (in random order).

[TagRelations]
tagId integer
closeTagId integer
proximity integer

Example for selection of similar tags:

select closeTagId from TagRelations where tagId = :tagID and proximity < 3
Ralph Rickenbach
The proximity is one-way; if it should be two-way, then a different record with a different proximity would be inserted.
ΤΖΩΤΖΙΟΥ