views:

378

answers:

2

Suppose I have a large list of objects (thousands or tens of thousands), each of which is tagged with a handful of tags. There are dozens or hundreds of possible tags and their usage follows a typical power law: some tags are used extremely often but most are rare. All but the most frequent couple dozen tags could typically be ignored, in fact.

Now the problem is how to visualize the relationship between these tags. A tag cloud is a nice visualization of just their frequencies but it ignores which tags occur with which other tags. Suppose tag :bar only occurs on objects also tagged :foo. That should be visually apparent. Similarly for three tags that tend to occur together.

You could make each tag a bubble and let them partially overlap with each other. Technically that's a Venn diagram but treating it that way might be unwieldy. For example, Google charts can create Venn diagrams, but only for 3 or fewer sets (tags): http://code.google.com/apis/chart/docs/gallery/venn_charts.html
The reason they limit it to 3 sets is that any more and it looks horrendous. See "extentions to higher numbers of sets" on the Wikipedia page: http://en.wikipedia.org/wiki/Venn_diagrams

But that's only if every possible intersection is non-empty. If no more than 3 tags ever co-occur (maybe after throwing out the rare tags) then a collection of Venn diagrams could work (with the sizes of the bubbles representing tag frequency).

Or perhaps a graph (as in vertices and edges) with visually thicker or thinner edges to represent frequency of co-occurrence.

Do you have any ideas, or pointers to tools or libraries? Ideally I'd do this with javascript but I'm open to things like R and Mathematica or really anything else. I'm happy to share some actual data (you'll laugh if I tell you what it represents) if anyone is curious.

+1  A: 

I would create something like this if you are targeting the web. Edges connecting the nodes could be thicker or darker in color, or perhaps a stronger force connecting them so they are close in distance. I would also add the tag name inside the circle.

Some libraries that would be very good for this include:

Some other fun javascript libraries worth looking into are:

Jay Askren
Thanks Jay! I didn't know about protovis; that's excellent. As for platform/language, web/javascript is ideal but if Mathematica or R or something has a great way to do this, I'd love to know about that as well. As for the force-directed layout, what I don't like about that is that it's not capturing the subset relationships. Maybe something like this -- http://vis.stanford.edu/protovis/ex/bubble.html -- but where the bubbles can be inside each other.
dreeves
Thanks again, Jay. Protovis seems to be javascript, not flash, if I'm not mistaken.
dreeves
Oops. You are correct. I accidently reversed Protovis and Flare. Should be correct now.
Jay Askren
+3  A: 

If i understand your question correctly, an "image matrix" should work nicely here. The implementation i have in mind would be a n x m matrix in which the tagged items are rows, and the tags are columns. The matrix consist entirely of "1's" and "0's", i.e., a particular item either has a given tag or it doesn't.

In the matrix below (which i rotated 90 degrees so it would fit better in this window--so columns actually represent tagged items, and each row shows the presence or absence of a given tag across all items), i simulated the scenario in which there are 8 tags and 200 tagged items. , a "0" is blue and a "1" is light yellow.

All values in this matrix were randomly selected (each tagged item is eight draws from a box consisting of two tokens, one blue and one yellow (no tag and tag, respectively). So not surprisingly there's no visual evidence of a pattern here, but if there is one in your data, this technique, which is dead simple to implement, can help you find it.

I used R to generate and plot the simulated, using only base graphics (no external packages or libraries):

# create the matrix
A = matrix(data=r1, nrow=1, ncol=8)

# populate it with random data
for (i in seq(0, 200, 1)){r1 = sample(0:1, 8, replace=TRUE); A = rbind(A, r1)}

# now plot it
image(z=A, ann=F, axes=F, col=topo.colors(12))

alt text

doug
Wow, great idea! Thank you! I'll try it and see how it looks.
dreeves