tags:

views:

58

answers:

2

There is a constraint in the content management system that requires to store all word documents with specific extension (different from DOC or DOCX). However, when outputting the document to user we need to know if it is a DOC or DOCX file in order to provide the right MIME type.

So, is there a way to programatically find out if document is DOC or DOCX by its content?

+3  A: 

Here is a link to the ForensicsWiki which details lots of different file types. It describes the headers of both DOC and DOCX files, so you should be able to parse the files and determine what kind they are.

Looking at the link, .doc files are OLE Compound Files, the file should have the following binary header:

d0 cf 11 e0 a1 b1 1a e1

In constrast, .docx files will have the binary signature:

50 4b
samoz
+2  A: 

DOCX files are in ZIP format, in which the first two bytes are the letters PK (after ZIP's creator, Phil Katz).

RichieHindle
Thank you guys, seems to be quite clear and easy
Andriy