views:

162

answers:

3

Given words and their frequencies and an area of screen real estate, what are good approaches to fitting a tag cloud to the space? The two variables I can think of to manipulate are:

  • Font sizes (both absolute and the gradient)
  • Number of words

Everything approach I can think of requires iteration, like setting an upper bound on the number of words then using binary search on font sizes until the words just fit the area. I'd rather have an analytical solution.

One complication of my situation is that the clouds are resizable, so the algorithm needs to be able to handle 100x100 pixels or 1000x1000 pixels reasonably well.

Edit: I should have said this is for a rich client application, not the web (hence the possibility of resizing). Also, I was hoping to hear some experience like "nobody ever looks at more than 100 words in tag cloud so don't bother displaying them".

A: 

This sounds like the knapsack problem, but inverted and with more variables. There is no trivial complete solution, but it is likely you will be able to find a heuristic algorithm that comes close to the optimal solution in most cases.

PS: You can only make this work reliably with font sizes measured in pixels. Font sizes measured in pixels are a Bad Thing (TM) in good web design.

Sparr
A: 

You could create a predetermined set of incidence ranges, which could then relate to a font size in your cloud. For example:

  • 0 - 100: 1 em
  • 101 - 500: 1.2 em
  • 501 - 1000: 1.4 em bold
  • 1001 - 1500: 1.8 em bold
  • 1501 - 2000: 2.0 em bold italic/underlined/flashing/whatever etc...

You could scale the cloud by adding a fixed offset to all the ranges based on the size of the container.

Dave Swersky
Is there any way to get the size of a container as measured in ems?
Sparr
+1  A: 

What we do in Software Cartographer is

  • have a maximum font size,
  • map Math.sqrt(term.frequency) to this range (since words are 2D areas),
  • only show the top 30 (or so) terms,
  • exclude any fine print, ie font size smaller than 6 pt,
  • sort the terms in the cloud alphabetically.

Alternatives

  • Instead of showing the top 30, choose the top k such that there are no scroll bars.
  • Instead of mapping the most frequent word to the max font size, use a global mapping such that word size are comparable between clouds (this depends on your use case).

To my best knowledge, no empirical studies on term clouds are available (maybe Jonathan Feinberg, of Worlde fame, knows more in that regard).

Adrian