views:

92

answers:

4

I'm trying to understand the stategy or idea's for building spacial maps of related/common keywords or tags. Using SO as an example; if you go to http://stackoverflow.com/tags and type in "python" you will get all tags that have that word in it, but no tags that might be closely related ( WSGI, Google's App Engine, flying, etc ).

In line with my question, how could you build a spatial map that could be queried to find closely related tags/keywords from the search, ordered by their weight? But then how to store say tag foo's weight to a potentially larger number of tags and still keep the system responsive?

I've already watched this Google Tech-talk by David Weinberger which is a great tech talk that has gotten me thinking. http://video.google.com/videoplay?docid=2159021324062223592&ei=qseASZvgI6e4qAP91a2PDg&q=google+tech+talk

A: 

You need a good search engine. ;)

Do it yourself: implementing one of the similarity algorithms. For example: Levenshtein distance or Dice's coefficient.

Or use something ready to use like Lucene.

Eduard Wirch
Levenshtein distance finds similarity in "python" and "pithon", not "python" and "WSGI"
Sparr
A: 

It seems that the most likely way to build the data regarding such relationships would be to catalog which tags appear together the most often, while appearing together with the least number of other tags.

That is, "c++" and "stl" appear together a lot, and "stl" rarely(?) appears without "c++", so they are related (in at least one direction). "c++" and "algorithm" also appear together a lot, but they appear apart even more often, so they are not related.

Sparr
That sounds more like an Informed search ( http://en.wikipedia.org/wiki/Search_algorithm#Informed_search ) which is usually tree based and relies on a human to dictate structure.
David
The structure would be dictated by existing tagged questions
Sparr
A: 

In thinking of how the data could be structured, one idea I had could possibly be a four tables system. one table would be source data (ex. with SO there has to be some sort of question table), which is joined to a tag table and then a tag weight table that joins back to the tag table.

#pseudo code
     source table {
     id: int
     source_data: text   
     }

     source_tag table {
        source_id: int
        tag_id: int
     }

     tag table{
      id: int
      tag: String(30)
     }

    tag_weight table {
        base_tag_id: int
        weight: float( 0-10 or 100 ) or int ( count of mutual occurrence )
        source_tag_id: int      
    }

I have no idea how efficient this structure is, but I suppose its something to work on. Otherwise to make it work, new admissions to source data could fire of an after update trigger or have a worker process in the background rebalance the weights at preset times.

David
+1  A: 

Check the clustering concepts from O'Reilly's "Programming Collective Intelligence".

aldrinleal