This is an interview question, and should concern about efficiency. How to calculate occurrences of specified word in a large text file? I can only think of indexOf() method in most programming languages, but I don't think that the correct answer.
The best way to identify a word occurrence, as opposed to that sequence of characters just occurring as a substring of a line of the file, is probably with a regex Pattern
compiled from \bword\b
-- the \b
are "word boundaries".
Once you have that Pattern
there's no direct method to count the number of occurrences in a line, so you'd need some benchmark to find out what's faster -- a split
(taking the length of the resulting array of strings minus one), not likely but possible, or making a Matcher
with the matcher
method of the pattern then looping on its find
method while counting (I'd bet on this one), or something else again. But detecting word-boundaries on your own is enough of a PITA that I tend to always use regular expressions for the task;-).
It's possible to squeeze some speed by reading (and counting word occurrences on) more than one line at a time -- say a MB at a time. But if you do so then you must take care about the last "partial" line in the megabyte-gulp, since a occurrence of the word might possibly be split between the end of that partial line and the start of the next gulp -- feasible, but the kind of optimization one does just under duress since it's so easy to introduce a bug;-).
If the text file is really large, indexOf() might not be a good idea because you would need to load the whole file into a string and therefore chew up memory. Given enough data and you would crash the program. I think you would need to look into the stream reading APIs to read the file in chunks that are more practical to scan with indexOf().
Read the file using buffered stream char-by-char into array until whitespace character encountered or group of them (spaces, tabs, new lines, ...), compare contents of that array with target word, increase counter if match, clear the array, return to reading.
Preallocate array of enough size and reuse it for reading, grow it if needed, do not allocate it on every iteration. Do not actually clear the array every time, just set its read counter to zero.
Also, you may combine reading of char and comparing it with target into single loop eliminating need for intermediate array. First variant is easily convertible into this one, just throw out the array and compare on the fly, you need only to know current char and its position in the word.
What you want is the Boyer-Moore algorithm. It is the most efficient known general method for this problem.