tags:

views:

244

answers:

1

Hello all,

I have created a python script which pulls data out of OLE streams in Word documents, but am having trouble converting the OLE2-formatted timestamp to something more human-readable :(

The timestamp which is pulled out is 12760233021 but I cannot for the life of me convert this to a date like 12 Mar 2007 or similar.

Any help is greatly appreciated.

EDIT: OK I have ran the script over one of my word documents, which was created on 31/10/2009, 10:05:00. The Create Date in the OLE DocumentSummaryInformation stream is 12901417500.

Another example is a word doc created on 27/10/2009, 15:33:00, gives the Create Date of 12901091580 in the OLE DocumentSummaryInformation stream.

The MSDN documentation on the properties of these OLE streams is http://msdn.microsoft.com/en-us/library/aa380376%28VS.85%29.aspx

The def which pulls these streams out is given below:

import OleFileIO_PL as ole

def enumerateStreams(item):
    # item is an arbitrary file
    if ole.isOleFile('%s' % item):
        loader = ole.OleFileIO('%s' % item)
        # enumerate all the OLE streams in the office file
        streams = loader.listdir()
        streamProps = []
        for stream in streams:
            if stream[0] == '\x05SummaryInformation':
                # get all the properties fro the SummaryInformation OLE stream
                streamProps.append(loader.getproperties(stream))
            elif stream[0] == '\x05DocumentSummaryInformation':
                # get all the properties from the DocumentSummaryInformation stream
                streamProps.append(loader.getproperties(stream))
     return streamProps
+2  A: 

(0) Please clarify "like 12 Mar 2007 or similar": do you mean that you expect the 11-digit int to convert to 12 Mar 2007, or is "12 Mar 2007" merely intended to convey the format in which you want to display the date? If the latter, can't you provide expected results by inspecting some files with MS Word or OpenOffice.org's word processing gadget? How do you intend to verify that any solution that is offered actually works?

(1) Please give more than one (OLE, expected) pair so that correct operation of any proposed solution can be verified with more assurance. If possible, can you create examples with known expected values like 01 Jan 2000, 01 Jan 2001, 02 Jan 2001, 02 Feb 2001?

(2) It is not obvious from "pulls data out of OLE streams" whether you want the file creation etc timestamps in the OLE2 compound document header, or whether you want timestamps that are present in the content. Please say WHERE you are trawling for timestamps. It would also help tremendously if you could give a reference to the MS documentation that relates to the timestamps you are interested in ... surely it must tell you what the format is, even if it does so indirectly by one or two intra/extra-document hops.

(3) Please show HOW you are pulling that out -- is it a string? fixed 11 bytes? Or is it str(some int that you have converted from a 64-bit field)? Converted HOW?? As well as a description, show your conversion code. Don't retype your code from memory; use copy/paste.

Please provide the requested info by editing your question, not as comments.

Update while waiting for info:

The file creation and modification timestamps in an OLE compound document header appear to be 64-bit little-endian integers representing (seconds since 1601-01-01T00:00:00) * 10 ** 7.

The DATE type used in data in OLE2 data appears to be 64-bit little-endian IEEE 754 float representing (days and a fraction thereof) since 1899-12-30T00:00:00. Yes the day is 30, not 31.

Update after examining the 2 examples supplied:

The difference between the two observed timestamps (which will be in your local time) is 325920 seconds:

>>> import datetime
>>> t0 = datetime.datetime(2009,10,27,15,33,0)
>>> t1 = datetime.datetime(2009,10,31,10,5,0)
>>> t1-t0
datetime.timedelta(3, 66720)
>>> secs = 3 * 24 * 60 * 60 + 66720
>>> secs
325920

This is the same as the difference between the two magic numbers:

>>> 12901417500 - 1290191580
325920

So the magic numbers represent seconds since some epoch ...

>>> m1 = 12901417500
>>> days, seconds = divmod(m1, 60*60*24)
>>> epoch = t1 - datetime.timedelta(days, seconds)
>>> epoch
datetime.datetime(1601, 1, 1, 11, 0)

So the magic numbers represent seconds since 1601-01-01T00:00:00Z and your TZ is 11 hours away from UTC.

Those two magic numbers won't fit in 32 bits ... looks like either (a) it is stored in 64 bits as seconds since 1601 (a waste of about 29 bits!) or (b) it is stored as (number of 100-nanosecond units) since 1601 as expected but something is dividing it by 10**7 before you see it.

The documentation reference that you gave merely says that it's a VF_FILETIME (UTC) type. Googling that, I find a couple of MS clues on calling Windows functions to manipulate the timestamps, but no definition as far as I looked. However there are two 3rd party notes (from perlmonks and the Apache POI project) which say much the same thing: """This looks like a Windows VT_FILETIME data type which is a 64 bit unsigned integer representing the number of elapsed 100 nanoseconds since 1 January 1601"""

Update from the crime scene:

Seems you are using OleFileIO_PL to read the files. A quick rummage through the sole source file reveals this:

    elif type == VT_FILETIME:
        value = long(i32(s, offset+4)) + (long(i32(s, offset+8))<<32)
        # FIXME: this is a 64-bit int: "number of 100ns periods
        # since Jan 1,1601".  Should map this to Python time
        value = value / 10000000L # seconds
John Machin