views:

475

answers:

4

So I have about 4,000 word docs that I'm attempting to extract the text from and insert into a db table. This works swimmingly until the processor encounters a document with the *.doc file extension but determines the file is actually an RTF. Now I know POI doesn't support RTFs which is fine, but I do need a way to determine if a *.doc file is actually an RTF so that I can choose to ignore the file and continue processing. I've tried several techniques to overcome this, including using ColdFusion's MimeTypeUtils, however, it seems to base its assumption of the mimetype on the file extension and still classifies the RTF as application/msword. Is there any other way to determine if a *.doc is an RTF? Any help would be hugely appreciated.

Thanks in advance, --Anne

+6  A: 

The first five bytes in any RTF file should be:

{\rtf

If they aren't, it's not an RTF file.

The external links section in the Wikipeida article link to the specifications for the various versions of RTF.

Doc files (at least those since Word '97) use something called "Windows Compound Binary Format", documented in a PDF here. According to that, these Doc files start with the following sequence:

0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1, 0x1a, 0xe1

Or in older beta files:

0x0e, 0x11, 0xfc, 0x0d, 0xd0, 0xcf, 0x11, 0xe0

According to the Wikipedia article on Word, there were at least 5 different formats prior to '97.

Looking for {\rtf should be your best bet.

Good luck, hope this helps.

MBCook
I did noticed in some of the POI code that a PushbackInputStream is instantiated that pulls a byteArray of the first 6 bytes. I attempted the same thing on the coldfusion side and was able to successfully get the byteArray, but I've gotten stuck trying to figure out how to convert the byteArray to a string that's readable by CF so I can check for {\rtf. Instead all I can get are numbers. Any ideas?
Anne Porosoff
Can you just do a standard FileRead on it?
Peter Boughton
A: 

You could try identifying the files with the Droid tool (Digital Record Object Identification), which provides access to the Pronom technical registry.

Fabian Steeg
+4  A: 

With CF8 and compatible:

<cffunction name="IsRtfFile" returntype="Boolean" output="false">
    <cfargument name="FileName" type="String" />
    <cfreturn Left(FileRead(Arguments.FileName),5) EQ '{\rtf' />
</cffunction>


For earlier versions:

<cffunction name="IsRtfFile" returntype="Boolean" output="false">
    <cfargument name="FileName" type="String" />
    <cfset var FileData = 0 />
    <cffile variable="FileData" action="read" file="#Arguments.FileName#" />
    <cfreturn Left(FileData,5) EQ '{\rtf' />
</cffunction>


Update: A better CF8/compatible answer. To avoid loading the whole file into memory, you can do the following to load just the first few characters:

<cffunction name="IsRtfFile" returntype="Boolean" output="false">
    <cfargument name="FileName" type="String" />
    <cfset var FileData = 0 />

    <cfloop index="FileData" file="#Arguments.FileName#" characters="5">
     <cfbreak/>
    </cfloop>

    <cfreturn FileData EQ '{\rtf' />
</cffunction>


Based on the comments:
Here's a very quick way how you might do a generate "what format is this" type of function. Not perfect, but it gives you the idea...

<cffunction name="determineFileFormat" returntype="String" output="false"
    hint="Determines format of file based on header of the file's data."
    >
    <cfargument name="FileName" type="String"/>
    <cfset var FileData = 0 />
    <cfset var CurFormat = 0 />
    <cfset var MaxBytes = 8 />
    <cfset var Formats =
     { WordNew  : 'D0,CF,11,E0,A1,B1,1A,E1'
     , WordBeta : '0E,11,FC,0D,D0,CF,11,E0'
     , Rtf      : '7B,5C,72,74,66' <!--- {\rtf --->
     , Jpeg     : 'FF,D8'
     }/>

    <cfloop index="FileData" file="#Arguments.FileName#" characters="#MaxBytes#">
     <cfbreak/>
    </cfloop>

    <cfloop item="CurFormat" collection="#Formats#">
     <cfif Left( FileData , ListLen(Formats[CurFormat]) ) EQ convertToText(Formats[CurFormat]) >
      <cfreturn CurFormat />
     </cfif>
    </cfloop>

    <cfreturn "Unknown"/>
</cffunction>


<cffunction name="convertToText" returntype="String" output="false">
    <cfargument name="HexList" type="String" />
    <cfset var Result = "" />
    <cfset var CurItem = 0 />

    <cfloop index="CurItem" list="#Arguments.HexList#">
     <cfset Result &= Chr(InputBaseN(CurItem,16)) />
    </cfloop>

    <cfreturn Result />
</cffunction>

Of course, worth pointing out that all this wont work on 'headerless' formats, including many common text-based ones (CFM,CSS,JS,etc).

Peter Boughton
This is perfect! Out of curiousity, could this technique potentially be used to detect the version of Word that the document was created with? I've run across another problem where POI is throwing a fit over a file that it thinks was created with Word 95. Alternatively, could I potentially just forgo POI altogether and load the data pulled with FileRead() and load it into the db that way? In the end my purpose is simply to have the text of the doc available for searching but not displaying.
Anne Porosoff
If you can identify the file marker sequences for the different versions, this technique could be used for multiple formats, since a lot of binary file formats start with upto 8 bytes that identify the format in this way.
Peter Boughton
For reading the whole files... well, using FileRead will treat files as text - so I don't know if it might corrupt a Word document.If it did, you could try FileReadBinary, but I'm not then sure if it would be searchable as text in your database.
Peter Boughton
I tried just the straight up FileRead() and FileReadBinary() and in both cases I was able to get the readable text except is prepended and appended with various junk. With that in mind, I may just go that route since at least I do get some of the text which would be enough for searching. Just not entirely ideal. Nevertheless, thanks for your help.
Anne Porosoff
+1  A: 

You can convert the byteArray to a string

<cfset str = createObject("java", "java.lang.String").init(bytes)>

You might also try the hasxxxHeader methods from POI's source. They determine if an input file is something POI can handle: OLE or OOXML. But I believe someone else suggested using a simple try/catch to skip problem files. Is there a reason you do not wish to do that? It would seem the simpler option.

Update: Peter's suggestion of using CF 8's function would also work

<cfset input = FileOpen(pathToYourFile)>
<cfset bytes = FileRead(input , 8)>
<cfdump var="#bytes#">
<cfset FileClose(input)>
Ah, even better than the loop method. Should probably have an explicit FileClose(input) in there also though?
Peter Boughton
Yes, it should definitely have an explicit FileClose(..). I forgot to copy that line.