views:

259

answers:

4

Saving a Word 2003 document to XML and then back results in a reduced file size, and probably more that I don't know about. A diff on the WordML of the new document against the old shows differences only in the revision save ID's. So, what is getting lost in the roundtrip?

If nothing is actually getting lost, then how would one explain the few thousand bytes off the size of the file?

+2  A: 

As far as I know Word stores some information in addition to text and formatting in the DOC files, for example user information, some stuff on the document history, etc. This information accumulates when using "File > Save". I suppose that saving as XML and re-saving as DOC strips that information.

If I recall correctly, as simple "Save As" reduces file size already and I think there used to be some menu item that allowed you to save a version of the DOC file that was significantly smaller in size than the "File > Save" version.

Thorsten Dittmar
+1  A: 

If you look at a word document (.doc) in a hex editor, you will see that there are many, many blocks of redundant zeroes. Great format, doc!

Anyway, saving to XML and then back to doc might get rid of some of those thousands of zeroes bytes.

If you're really curious just open both files in a hex editor and run a difference algorithm, you can try Hex Workshop and Hex Editor Neo.

Danra
A: 

My experiments with a few large Word 2003 documents shows that saving as XML, then saving that as .doc, indeed results in a slightly, though not significantly, smaller file. As you point out, the rsidR attributes are different, but that does not account for the reduction in size since the new rsidRs are typically the same size.

As Danra points out, .doc files have runs of identical bytes. But the smaller file saved as .doc also has such runs, so I believe this is an artifact of the .doc binary format and not information-carrying data. I eyeballed a few of the round-tripped .doc files and could see no difference in appearance at all, supporting the idea that the differences are not information-carrying.

Examining the XML files created after round-tripping shows the main difference is several rPr (run properties) with no content are removed after converting to XML. It seems saving as XML removes unused character styles and properties.

Dour High Arch
+3  A: 

The following is just a guess.

.doc file is actually OLE structured storage compound file. The latter is a way to pack multiple streams in a single document in a well-defined way, and the structure is actually pretty close to a filesystem-in-a-file - for example, it has "sectors", and sector allocation table. Such an approach makes it possible to edit document file in-place without rewriting it completely.

However, this storage approach results in some redundancy, such as unused sectors. When you roundtrip the file, you effectively recreate it from scratch, and thus any such redundant storage artefacts are eliminated.

Pavel Minaev
I believe your answer here is on target--I have heard the redundancy referred to as "binary dust". I think your answer here is effectively what any "roundtrip" on a file is meant to do: eliminate redundancy.Thanks for making aware those two links also.
jJack
Yes, see the "fast save" feature: http://support.microsoft.com/kb/197978
plutext