views:

161

answers:

4

What's a reliable way to automatically count the characters and/or words in a .doc or .docx file?

The only real requirement is a reasonably accurate and reasonably reliable count.
It needs to work with documents containing something other than Latin script, so counting characters is good enough for most cases.
The count does not necessarily need to match Word's, but the closer the better.
Since there are a gazillion different apps that can generate .doc files, it's okay to fail to count anything, but this case needs to be catchable so we're aware that a count may be inaccurate. For all other cases the count must be, say, at least 99% accurate at least 99% of the time.

I'm open as to the involved technologies, but something that can run on a *NIX command line would be greatly preferred.

Is there a reasonable solution for this?

A: 

Microsoft has published a specification for the Office binary file formats. Parsing a .DOC file doesn't look trivial, but with some care you should be able to get a dependable, repeatable result. I have no idea how closely it'll match with what Word shows -- that will probably depend (at least partly) on how you define "word" -- for example, whether you consider a group of digits a "word" or not. It probably won't take a lot to figure out how Word treats cases like that, so getting a close match shouldn't be terribly difficult.

Jerry Coffin
I am not intending to write my own .doc parser, thank you very much. http://www.joelonsoftware.com/items/2008/02/19.html :-)
deceze
+3  A: 

Here's a link to some Linux word-to-text converters.

For example you could use

antiword file.doc | wc

to do the counting.

Edit:

This link shows that AbiWord has a command-line interface, that you could use to convert the .docx format to .txt and then count the words using "wc". AbiWord does support the docx format

beny23
Antiword looks pretty good, thanks. Any solution for .docx files though?
deceze
AbiWord might be the way to go.
beny23
A: 

If you consider online applications as a solution, yes, there is a solution.
This not so pretty (regarding the design) site offers both word and character count: http://allworldphone.com/count-words-characters.htm

I don't think there is a limit, and it shouldn't be a problem to just copy/paste the contents of your documents into the corresponding textarea and see the result.

Regarding the 100% or 99% accuracy, you could test it with a few (i.e. 20-50 words) by counting them yourself first.

I hope this helps. Regards. Chris

cr0z3r
I think the main problem is to get the plain text out of the word document without opening word (which is what deceze would have to do to cut and paste the contents)
beny23
+1  A: 

Mac OS X has support for reading word files built into the system frameworks, so if you have that, it's easy. MacRuby sample:

NSSpellChecker.sharedSpellChecker.countWordsInString(NSAttributedString.alloc.initWithURL(fileURL, documentAttributes:nil), language:nil)

More portably — though it gives up support for docx — you could simply get Antiword and do antiword | wc -w.

Chuck
That might be a solution, though probably hard to use in production (Linux server). I'll look into it though...
deceze