views:

160

answers:

2

I have been able to copy the raw data from an otherwise inaccessible USB drive into a monolithic file of about 250MB. Somewhere in that blob of bytes are about 40 Word documents.

  1. Where do I find documentation about the internal structure of Word documents such that I can parse the byte-stream, recognise where a Word doc starts and finishes and extract a copy?

  2. Are there any libraries in any programming language specific to this task?

  3. Can anyone suggest an already existing software solution to this issue?

+2  A: 

The Apache POI project has a library for reading and writing all kinds of MS Office docs. If the files are in the new XML base OOXML format, you'll be looking for the start of a zip file as the XML is compressed.

sblundy
I have had trouble reading .docx files as zip file so don't count TO much on that. OTOH I was having lots of other problems there to so, 64mg NaCl
BCS
+4  A: 

Two approaches:

You can mount files as volumes in linux. Provided your binary blob isn't too corrupted, you'll probably be able to break down the filesystem to find out where you files are located. Is (was) it a FAT partition or NTFS?

If that doesn't work, I'd look for this string of bytes:

D0 CF 11 E0 A1 B1 1A E1

These are the "magic bytes" of office document file signatures. They might occur randomly in other data, but it's a start. You're going to run into MAJOR issues if the files are fragmented.

Also, try to recreate pieces of the document(s) in Word as is, save it to a file and extract chunks to search for in the blob (using grep binary or whatever). Provided you have info from all parts of the file you should be able to decode WHERE in the blob they are. Piecing it back into a working DOC binary seems far fetched, but recovering the rest of the text shouldn't be impossible.

Stefan Mai