ansaurus

Question

What's the best way to determine the total number of words of a file in Java?

Answer 1

A:

I'd initialize a word_count int to 1, then loop through each character in the file and increment word_count for every whitespace character unless the previous character was a whitespace character. (A space, tab, or newline.)

yjerem 2008-11-10 05:56:33

Answer 2

A:

making some assumptions about what defines a 'word', one solution would be to open the file using a text stream reader and scan it, counting the number of non-contiguous whitespace characters, plus one for the end, e.g.

 this is some sample text
 this is some more sample text

the text above would have 11 words in it, counted as 9 spaces and 1 newline and 1 end-of-file

Steven A. Lowe 2008-11-10 05:58:32

Answer 3

+2 A:

While Perl can do this, I'd consider it overkill to link it in / call it for this kind of task (unless you already have that tested out).

My suggestion would be to lookfor & learn from code that does what you need on the web, e.g. here: http://schmidt.devlib.org/java/word-count.html

lexu 2008-11-10 06:01:19

Answer 4

+14 A:

int count = 0;
Scanner sc = new Scanner(new File("my-text-file.txt")); 
while (sc.hasNext()) {
   ++count;
   sc.next();
}

Itay 2008-11-10 06:20:21

Answer 5

+8 A:

Congratulations you have stumbled upon one of the biggest linguistic problems! What is a word? It is said that a word is the only word that actually means what it is. There is an entire field of linguistics devoted to words/units of meaning - Morphology.

I assume that you question pertains to counting words in English. However, creating a language-neutral word counter/parser is next to impossible due to linguistic differences. For example, one might think that just processing the groups of characters separated by white space is sufficient. However, if you look at the following example in Japanese, you will see that that approach does not work:

これは日本語の例文です。

This example contains 3 distinct words and none of them are separated by spaces. Typically, Japanese word boundaries are parsed using a dictionary-based approach and there are a number of commercial libraries available for this. Are we lucky to have spaces in English! I believe that Indic languages, Chinese and Korean also have similar problems.

If this solution is going to actually be deployed in any ways that multi-lingual input is possible, it will be important to be able to plug in different word counting methods depending upon the language being parsed.

I think the first answer was a good answer because it uses Java's knowledge of Unicode whitespace values as delimiters. It tokenizes by matching using the following regex: \p{javaWhitespace}+

Elijah 2008-11-10 09:32:43

Be careful using \p{javaWhiteSpace} in Java, because it does not correspond to the Unicode \p{Space} property such as you might find in Perl. Both cover code points 0009, 000A, 000B, 000C, 000D, 00A0, 2007, and 202F. Java whitespace also includes 001C, 001D, 001E, which are not Unicode whitespace. Java whitespace ignores several Unicode whitespace code points, of which the most egregious is 00A0, NO-BREAK SPACE. This has gotten me into trouble before, so be very careful.

tchrist 2010-10-30 05:44:03

Answer 6

+1 A:

If you are on a unix, wc -w filename will do the trick.

Leon Timmermans 2008-11-10 16:59:01

ansaurus

tags:

views:

answers:

What's the best way to determine the total number of words of a file in Java?

related questions