How to count the words in a document, get the result same as the result of MS OFFICE?
Without knowing your environment all I can tell you is that you would need to implement something like this:
- Take the entire document as a string.
- Split the string on whitespace.
- The number of items in the resulting sequence will be the number of words in the document.
Basic word splitting uses whitespace and punctuation (.,?!"'- etc - indeed any non-alphanumeric or character usually) characters to split the words.
Make sure you skip sequences of punctuation/whitespace instead of counting extra "words" between them.
You will have to decide whether numbers are "words" or not. And whether "$123,456.78" is one word or three.
You may also want to apply other rules - for example, if you are looking for words in source code, you may wish to treat +-=*/()&^%$ characters as "whitespace". If you have identifiers in camelCase or PascalCase styles, you may want to take the "words" you have found and check if they have uppercase characters in the middles or the words.
Fundamentally, it's an easy problem - you just have to decide what a "word" is. You can be as simple or as complicated as you like about it.
The best way to get the same word count as Office would be to use macros or automation to use MS Word to load the text and calculate the word count.
In theory you'd first have to define what you see as a word (see also Jason Williams' post). Then you open the document with whatever language you're planning to use for this. You translate the document from Microsoft's proprietary format to something nice and clean.
Then its simply a matter of counting the occurrences of the afore mentioned word definition.
The hard part here will be the parsing of the office document. Luckily for you, Microsoft has relceased their proprietary format specification!
Its a bit long winded, but perhaps you can find somebody who has done the hard work for you, or you can try doing it from scratch.
Alternatively, if you're willing to reveal what language you're planning on using and what operating system, things can be a lot easier (if you're on Windows and have Office installed, for example, you can use OLE plug-ins.)
Also, have a look at this blog post about that format of Office documents featuring some helpful information (courtesy of google)