views:

235

answers:

4

This is an interview question, and should concern about efficiency. How to calculate occurrences of specified word in a large text file? I can only think of indexOf() method in most programming languages, but I don't think that the correct answer.

+2  A: 

The best way to identify a word occurrence, as opposed to that sequence of characters just occurring as a substring of a line of the file, is probably with a regex Pattern compiled from \bword\b -- the \b are "word boundaries".

Once you have that Pattern there's no direct method to count the number of occurrences in a line, so you'd need some benchmark to find out what's faster -- a split (taking the length of the resulting array of strings minus one), not likely but possible, or making a Matcher with the matcher method of the pattern then looping on its find method while counting (I'd bet on this one), or something else again. But detecting word-boundaries on your own is enough of a PITA that I tend to always use regular expressions for the task;-).

It's possible to squeeze some speed by reading (and counting word occurrences on) more than one line at a time -- say a MB at a time. But if you do so then you must take care about the last "partial" line in the megabyte-gulp, since a occurrence of the word might possibly be split between the end of that partial line and the start of the next gulp -- feasible, but the kind of optimization one does just under duress since it's so easy to introduce a bug;-).

Alex Martelli
+1 for your answer good idea, but some code would be nice as well :D
c0mrade
A: 

If the text file is really large, indexOf() might not be a good idea because you would need to load the whole file into a string and therefore chew up memory. Given enough data and you would crash the program. I think you would need to look into the stream reading APIs to read the file in chunks that are more practical to scan with indexOf().

Derek Clarkson
A: 

Read the file using buffered stream char-by-char into array until whitespace character encountered or group of them (spaces, tabs, new lines, ...), compare contents of that array with target word, increase counter if match, clear the array, return to reading.

Preallocate array of enough size and reuse it for reading, grow it if needed, do not allocate it on every iteration. Do not actually clear the array every time, just set its read counter to zero.

Also, you may combine reading of char and comparing it with target into single loop eliminating need for intermediate array. First variant is easily convertible into this one, just throw out the array and compare on the fly, you need only to know current char and its position in the word.

actual
He was talking about efficiency.Not getting the result.
Jagannath
Well, let's see -- write that algorithm in C and call it trough JNI :)Anyway, what is so inefficient in my solution?
actual
+2  A: 

What you want is the Boyer-Moore algorithm. It is the most efficient known general method for this problem.

RBarryYoung
Yeah, I didn't remember this algorithm until you've mentioned.
ZZcat