tags:

views:

33

answers:

1

This is weird, so be patient while I try to explain.

Basic problem: I have a massive string -- it can be of varying lengths depending on the user. My job is to acquire this massive string depending on the user, then send it off to the other piece of software to make a tag cloud. If life were easy for me, I could simply send the whole thing. However, the tag cloud software will only accept a string that is 1000 words long, so I need to do some work on my string to send the most important words.

My first thought was to count each occurrence of the words, and throw all this into an array with each word's count, then sort.

array(517) (
    "We" => integer 4
    "Five" => integer 1
    "Ten's" => integer 1
    "best" => integer 2
    "climbing" => integer 3
     (etc...)

Form here, I create a new string and spit out each word times its count. Once the total string hits 1000 words long, I stop. This creates a problem.

Let's say the word "apple" shows up 900 times, and the word "cat" shows up 100 times. The resulting word cloud would consist of only two words.

My idea is to somehow spit out the words at some sort of ratio to the other words. My attempts so far have failed on different data sets where the ratio is not great -- especially when there are a lot of words at "1", thus making the GCD very low.

I figure this is a simple math problem I can't get my head around, so I turn to the oracle that is stackoverflow.

thanks in advance.

+2  A: 

count all words then do this for each word in your array:

floor(count_of_the_word * (1000/numbber_of_total_words))

this will result in a maximum of 1000 words and all word appear in x times reduced by the according proportion.

so having 50 times cat 100 times gozilla 4000 looser and 4000 times bush 1000 times george will first result in

array(
    cat[50]
    gozilla[100]
    looser[4000]
    bush[4000]
    george[1000]
)

after looping and converting the numbers you will get this:

array(
    cat[5]
    gozilla[10]
    looser[437]
    bush[437]
    george[109]
)

resulting in 998 total words

ITroubs
if you want to totaly avoid the loss of words then just count how often you have a 0 after the transformation and reduce the biggest count by this number and add 1 to each word that has a 0 as a count
ITroubs
beat me to the punch. +1
nathan gonzalez
thanks everybody!!
jmccartie