This is weird, so be patient while I try to explain.
Basic problem: I have a massive string -- it can be of varying lengths depending on the user. My job is to acquire this massive string depending on the user, then send it off to the other piece of software to make a tag cloud. If life were easy for me, I could simply send the whole thing. However, the tag cloud software will only accept a string that is 1000 words long, so I need to do some work on my string to send the most important words.
My first thought was to count each occurrence of the words, and throw all this into an array with each word's count, then sort.
array(517) (
"We" => integer 4
"Five" => integer 1
"Ten's" => integer 1
"best" => integer 2
"climbing" => integer 3
(etc...)
Form here, I create a new string and spit out each word times its count. Once the total string hits 1000 words long, I stop. This creates a problem.
Let's say the word "apple" shows up 900 times, and the word "cat" shows up 100 times. The resulting word cloud would consist of only two words.
My idea is to somehow spit out the words at some sort of ratio to the other words. My attempts so far have failed on different data sets where the ratio is not great -- especially when there are a lot of words at "1", thus making the GCD very low.
I figure this is a simple math problem I can't get my head around, so I turn to the oracle that is stackoverflow.
thanks in advance.