ansaurus

Question

Find largest cluster of a particular word in a block of text

Answer 1

+2 A:

you could do a word-by-word moving average (over the last 400 words), while keeping track of the maximum seen so far. Once you're done, your maximum tell you which 400 words to use.

Draemon 2009-07-17 16:25:37

Answer 2

+6 A:

I'm not sure how you know they're highlighted but here's a simple O(n) aproach that I'd try.

scan the words into a circular queue (max capacity of 400) and if they're highlighted then increment the counter, once you reach the queues capacity, dequeue words as is necesary to enqueue the next one. when you dequeue a highlighted word decrement the counter. keep track of the max that your counter reaches at all times and where this chunk of 400 words starts at the max.

not too elegant but fairly simple.

Victor 2009-07-17 16:29:50

+1 that's what i was just writing up, dangit!

Peter Recore 2009-07-17 16:39:24

+1 good answer!

viksit 2009-07-17 20:50:50

Answer 3

+1 A:

This isn't exactly what you asked for, but I've used something like this in the past, while searching for the words (charPos refers to the starting character position of the word). Note: the '/' operator does integer division, i.e. 4200/2000 = 2.

if hasKey(charPositionHashtable[charPos/2000]):
    charPositionHashtable[charPos/2000]) += 1
else:
    charPositionHashtable[charPos/2000]) = 1

After the search completes, charPositionHashtable has a bunch of key/value pairs containing an "index" to 2000-character chunks, and the count of found words in them. Take the max, and use the chunk corresponding to that index. This has the advantage of being better than O(n), I think (but I didn't do a lot of analysis on this).

sigint 2009-07-17 16:39:48

Answer 4

+1 A:

You have the indicies of the highlighted words.. I think the below is a good, fast approach as it doesn't require "finding" every word (to do the circular loop). To do this it uses a "chunk size" derived from a number of characters rather than words. You could then "round up" or "round down" to the nearest word ending and there you have your chunk.

The method to get how many highlighted indicies are within a "chunk size" ahead in your sample could be better done I think.

Pseudo

string GetHighestDensityChunk(){

// {chunk size} = 400 * average word length
// {possible start positions} = 0, highlighted indicies, and (sample - {chunk size})

int position
int bestPositionSoFar = 0
int maxHighLightedCountSoFar = 0


for each position in {possible start position}
{
    highlightedCount = GetNumberOfHighlightedWithinChunkSize(position)

    if(highlightedCount > maxHighLightedCountSoFar) 
    {
        maxHighLightedCountSoFar = highlightedCount
        bestPositionSoFar = position
    }
}

// "round up" to nearest word end
// gives index of next space after end of chunk starting from current best position
{revised chunk size} = sample.indexOf(' ', startingAt = bestPositionSoFar + {chunk size}) - bestPositionSoFar

return sample.substring(bestPositionSoFar, {revised chunk size})
}   


 int GetNumberOfHighlightedWithinChunkSize(position)
{
    numberOfHighlightedInRange = 0

    // starts from current position and scans forward counting highlighted indicies that are in range
    for(int i= {possible start position}.indexOf(position); i<= {possible start position}.length; i++){
        if({possible start position}[i] < position + {chunk size}){
            numberOfHighlightedInRange++;
        } else {
            break;
        }
    }
    return numberOfHighlightedInRange;
}

Arj 2009-07-17 17:08:14

ansaurus

tags:

views:

answers:

Find largest cluster of a particular word in a block of text

related questions