tags:

views:

112

answers:

3

How to count the words in a document, get the result same as the result of MS OFFICE?

A: 

Without knowing your environment all I can tell you is that you would need to implement something like this:

  1. Take the entire document as a string.
  2. Split the string on whitespace.
  3. The number of items in the resulting sequence will be the number of words in the document.
Andrew Hare
how to count the CJK words, there is no space between words.
bruce dou
Does it make sense to use the term 'words' if there are no spaces to set them apart?
pavium
yes, you can see the feature in office. In cjk languages one word is one character.
bruce dou
According to your algorithm - wouldn't it be easier just to count spaces (or sequences of succeeding spaces) and add 1? So the answer would be spaces_count + 1
empi
No, Japanese is definitely a CJK language, and words written in Hiragana or Katakana use multiple characters per word. Korean is another CJK language, and words written in Hangul use multiple characters per word too. Heck, even "Beijing", obviously a Chineses word, is 北京, two characters. So, which CJK language has the one word=one character rule now that we've excluded Chinese, Japanese and Korean ?
MSalters
@empi - That wouldn't work as a document can contain any amount of whitespace characters (in other words you cannot guarantee that the document contains words with a single space between each).
Andrew Hare
A: 

Basic word splitting uses whitespace and punctuation (.,?!"'- etc - indeed any non-alphanumeric or character usually) characters to split the words.

Make sure you skip sequences of punctuation/whitespace instead of counting extra "words" between them.

You will have to decide whether numbers are "words" or not. And whether "$123,456.78" is one word or three.

You may also want to apply other rules - for example, if you are looking for words in source code, you may wish to treat +-=*/()&^%$ characters as "whitespace". If you have identifiers in camelCase or PascalCase styles, you may want to take the "words" you have found and check if they have uppercase characters in the middles or the words.

Fundamentally, it's an easy problem - you just have to decide what a "word" is. You can be as simple or as complicated as you like about it.

The best way to get the same word count as Office would be to use macros or automation to use MS Word to load the text and calculate the word count.

Jason Williams
A: 

In theory you'd first have to define what you see as a word (see also Jason Williams' post). Then you open the document with whatever language you're planning to use for this. You translate the document from Microsoft's proprietary format to something nice and clean.

Then its simply a matter of counting the occurrences of the afore mentioned word definition.

The hard part here will be the parsing of the office document. Luckily for you, Microsoft has relceased their proprietary format specification!

Its a bit long winded, but perhaps you can find somebody who has done the hard work for you, or you can try doing it from scratch.

Alternatively, if you're willing to reveal what language you're planning on using and what operating system, things can be a lot easier (if you're on Windows and have Office installed, for example, you can use OLE plug-ins.)

Also, have a look at this blog post about that format of Office documents featuring some helpful information (courtesy of google)

nash
thank for your advice. i have a set of algorithm, but the result is different from OFFICE. i will count all the languages.
bruce dou