Which pagecode was used to encode this DOC document? | ansaurus

tags:

character-encoding

views:

13

answers:

2

Q:

Which pagecode was used to encode this DOC document?

Hello

I got a bunch of .DOC documents. I'm not even positive they are Word documents, but even if they are, I need to open and parse them with eg. Python to extract information from them.

Problem is, I couldn't figure out how they were encoded: UltraEdit's Conversion function wouldn't correct the text no matter which encoding I tried. OpenOffice 3.2 also failed displaying the contents correctly (guessing Windows-1252).

Here's an example, hoping that someone knows what pagecode it is:

"lÕAssemblŽe gŽnŽrale" instead of "l'Assemblée générale"

Thank you for any tip.

A:

Greenstone digital library http://www.greenstone.org/ provides pretty good text extraction from word documents, including encoding detection.

Stephen 2010-03-03 15:49:01

I should add that I'd only use greenstone when a bunch was a significant number.

Stephen 2010-03-03 20:07:26

A:

Running msword in server mode gives you a range of scripting options- I'm sure detecting the encoding will be possible.

Stephen 2010-03-03 20:08:58

Thanks for the pointers.

OverTheRainbow 2010-03-05 12:22:12

related questions

Character encoding problem - PHP output, read by .NET, via HttpWebRequest

Setting the default Java character encoding?

Best Resource for Character Encodings

Displaying International Text

Trouble encoding a u umlaut with in a .Net http handler

MySQL collations not working as advertised in documentation

How to convert Unicode string into a utf-8 or utf-16 string?

Save all files in Visual Studio project as UTF-8

information seemingly coming out of mysqldb incorrectly, python django

UTF-8 latin-1 conversion issues, python django

Is there a Python library function which attempts to guess the character-encoding of some bytes?

when copy/paste 'hello' from Word into textarea it becomes 018hello 019 after saving

Java application failing on special characters.

HtmlEncode UTF-8

How to find which character set is used by the database

How the heck can you edit valid XML in a webpage?

looking for a UTF-8 text editor

How to convert a C string (char array) into a Python string?

What does 'lew' stand for in 'lew2' or 'lew4'?

Malformed UTF characters

Why is ¿ displayed different in Windows vs Linux even when using UTF-8?

Why does the string "¿" get translated to "Â¿" when calling .getBytes()

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

Help localizing application in Mac

UTF8 to/from wide char conversion in STL