views:

13

answers:

2

Hello

I got a bunch of .DOC documents. I'm not even positive they are Word documents, but even if they are, I need to open and parse them with eg. Python to extract information from them.

Problem is, I couldn't figure out how they were encoded: UltraEdit's Conversion function wouldn't correct the text no matter which encoding I tried. OpenOffice 3.2 also failed displaying the contents correctly (guessing Windows-1252).

Here's an example, hoping that someone knows what pagecode it is:

"lÕAssemblŽe gŽnŽrale" instead of "l'Assemblée générale"

Thank you for any tip.

A: 

Greenstone digital library http://www.greenstone.org/ provides pretty good text extraction from word documents, including encoding detection.

Stephen
I should add that I'd only use greenstone when a bunch was a significant number.
Stephen
A: 

Running msword in server mode gives you a range of scripting options- I'm sure detecting the encoding will be possible.

Stephen
Thanks for the pointers.
OverTheRainbow