tags:

views:

995

answers:

11

Recently in a technical interview, I was asked to write a program to find the high frequency words(Words which appear maximum number of times) in a text book. The program should be designed in such a way that, it processes the entire text book with minimum memory. Performance is not a concern. I was able to program to find the frequency of words, but it consumed a lot of memory.

How do you make this operation less memory intensive? Any strategies/solutions?

-Snehal

+1  A: 

One way would be to sort the list first.

We can sort the words in-place without a lot of memory (traded with slow performance).

And then we can have a simple counting loops that finds words with maximum frequency without having to save everything in memory since they're in sorted form.

chakrit
But you also need to use a very effective sorting algorithm.
Lucas McCoy
"performance is not a concern" ?
chakrit
Heapsort would work pretty well.
rlbond
This only raises more questions on sorting that you would then have to answer or implement. The gist of the question is to describe the algorithm.
aleemb
Is this an 'offline' sort (e.g. external sort on disk)? If not, I think the assumption that the list is in memory is poor.
Brian
I didn't assume that the list was "in memory" - the question already states that there won't be enough memory to store the whole book. Point is, you can do a simple "swap" operation directly on disk with minimal memory. That should enables you to do a sort.
chakrit
+2  A: 

Do you mean a lot of process memory? If so, one way would be to use the disk as virtual memory (aka write a filesystem wrapper).

dirkgently
I like this answer since it 'probes' at what 'memory' really means in the context of this question and demonstrates some knowledge.
Brian
Do you have any examples for using a file system wrapper?
Snehal
You need to write a wrapper to which you talk, rather than write to arrays on the stack/heap. This wrapper writes back to an in-memory buffer and/or periodically flushes the buffer contents to disk. So you only have a fixed amount of process memory usage at any time.
dirkgently
A: 

We are missing some definitions.

Please define precisely what is a high frequency. Please define what are your memory restriction.

why vote down? I know what is a frequency , I just needed it to be quantified because saying "high" is not enough to understand the exact restriction.
Why do you need it quantified? Just assume that he needs the top 3 most used words in the book. What's the difference? At that point he would be able to write code to get the top 5 most used or whatever he wanted.
Pete
Also, I didn't vote you down =P
Pete
+1 How little memory? Enough to hold the n top words? The complete file?
Stephan Eggermont
+3  A: 

If performance is really of no concern you could just go through each word in turn, check if it's in your "top N" and, if it isn't, count all it's occurrences. This way you're only storing N values. Of course, you'd be counting the same words many times, but, as you said, performance isn't an issue - and the code would be trivial (which is generally preferable - all other things being equal).

dommer
+1. Right, read through the same file over and over, keeping a trivial amount in memory at once, looking for that word.
Jason Cohen
this just says the same thing I did an hour before
aleemb
+4  A: 

You probably used hash tables which are memory-intensive but have a constant-lookup time--so the performance/memory trade off is obvious. By the time you reach the end of the book you will know your answer. Also, incrementing counters for each word is fast (because of the quick hashtable lookups).

The other end of the spectrum is to look at the first word, then go through the entire book to see how many times that word occurs. This requires minimal memory. Then you do the same for the next word and go through the entire book. If that word occurs more times, you add that as the top word (or top N words). Of course, this is extremely inefficient--if the first and third word are the same you'll end up going through the whole book again even though you just did the same thing for the first word.

aleemb
+2  A: 

A possible solution is to use a trie data structure for storing all words associated to their number of occurrences.

Other solutions may be found in answers to this related question: Space-Efficient Data Structure for Storing a Word List?

mouviciel
+2  A: 

OK, if you're only interested in the highest n occurring words, one way to do it is in two passes, with the first pass based on a modified Bloom Filter. Instead of using a bit map to track hash occurrences, use an integer array instead - either byte, 16 bit, 32 bit or even 64 bit depending on your input size. Where a Bloom filter simply sets the bit corresponding to each of the hash values of a word, you'll increment the count at the hash index in the array.

The problem with this approach is that two words will probably give the same hash values. So you need to do a second pass where you ignore words unless their hash totals are above a certain threshold, thus reducing the amount of memory you need to allocate to do accurate counting.

So just create a bit map with bits set for the highest occurring hash values. Then in the second pass of the words, if a word has "hits" in the bitmap for its hashes, look it up or add it to a hash table and increment its count. This minimises memory usage by creating a hash table of only the highest occurring words.

Mike Scott
I like this as a good compromise between space and time
Mark
+3  A: 

I'm a physicist, so my favourite approach is to approximate. You don't need to go through the entire text to get the most frequent words. Instead:

  • parse a chunk small enough to allow for your memory limitations,
  • skip a random amount of text,
  • repeat, combining accumulated results.
  • Stop when the list has satisfactorily converged.

If you use a memory-efficient algorithm for the smaller chunks (e.g. sorting) then you can get far faster performance than even the most efficient algorithm that reads every word.

Note: This does make the assumption that the most frequent words do occur most frequently throughout the text, not just at one place in the text. For english text, this assumption is true, because of the frequency of words like 'the' etc throughout. If you're worried about this requirement, require the algorithm to complete at least one pass of the entire text.

Phil H
+3  A: 

I'll probably get down-voted for this...

If the text is English and you just want to find the top 5 most frequent words, here is your program:

print "1. the\n";
print "2. of\n";
print "3. and\n";
print "4. a\n";
print "5. to\n";

Runs fast and consumes minimal memory!

Barry Brown
+1 for clever. :-)
Jason Cohen
superb static answer :)
lalitm
+2  A: 

Like many good interview questions, the question is phrased a little ambiguously/imprecisely, to force the interviewee to ask clarifying questions and state assumptions. I think a number of the other answers here are good, as they poke at these assumptions and demonstrate big-picture understanding.

I'm assuming the text is stored 'offline' somewhere, but there is a way to iterate over each word in the text without loading the whole text into memory.

Then the F# code below find the top N words. It's only data structure is a mapping of key-value pairs (word, frequency), and it only keeps the top N of those, so the memory use is O(N), which is small. The runtime is O(numWordsInText^2), which is poor, but acceptable given the problem constraints. The gist of the algorithm is straightforward, for each word in the text, count how many times it occurs, and if it's in the running best-N, then add it to the list and remove the previous minimum entry.

Note that the actual program below loads the entire text into memory, merely for convenience of exposition.

#light
// some boilerplate to grab a big piece of text off the web for testing
open System.IO 
open System.Net 
let HttpGet (url: string) = 
    let req = System.Net.WebRequest.Create(url) 
    let resp = req.GetResponse() 
    let stream = resp.GetResponseStream() 
    let reader = new StreamReader(stream) 
    let data = reader.ReadToEnd() 
    resp.Close() 
    data 
let text = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
let words = text.Split([|' ';'\r';'\n'|], System.StringSplitOptions.RemoveEmptyEntries)
// perhaps 'words' isn't actually stored in memory, but so long as we can 
// 'foreach' over all the words in the text we're good
let N = 5  // how many 'top frequency' words we want to find
let FindMin map =
    // key-value pair with mininum value in a map
    let (Some(seed)) = Map.first (fun k v -> Some(k,v)) map
    map |> Map.fold_left 
        (fun (mk,mv) k v -> if v > mv then (mk,mv) else (k,v)) 
        seed
let Main() =
    let mutable freqCounts = Map.of_list [ ("",0) ]
    for word in words do
        let mutable count = 0
        for x in words do
            if x = word then
                count <- count + 1
        let minStr,minCount = FindMin freqCounts
        if count >= minCount then
            freqCounts <- Map.add word count freqCounts
        if Seq.length freqCounts > N then
            freqCounts <- Map.remove minStr freqCounts
    freqCounts 
    |> Seq.sort_by (fun (KeyValue(k,v)) -> -v) 
    |> Seq.iter (printfn "%A")
Main()

Output:

[the, 75]
[to, 41]
[in, 34]
[a, 32]
[of, 29]
Brian
A: 

Well, if you want absolutely terrible performance...

Take the first word in the book, and count how many times it occurs. Take the second word in the book, count how many times it occurs. If it's more than the last word, discard the last word. And so forth... you'll end up counting the same words multiple times unless you keep a list of them somewhere, but if you really want to minimize memory, this should only require a few ints. Should run in O(n^2) time, where n is the number of words in the book.

Mark