views:

142

answers:

2

Hello:

I'm trying to extract the date/time when a picture was taken from the CR2 (Canon format for raw pictures).

I know the CR2 specification, and I know I can use Python struct module to extract pieces from a binary buffer.

Briefly, the specification says that in Tag 0x0132 / 306 I can find an string of length 20 - the date and time.

I tried to get that tag by using:

struct.unpack_from(20*'s', buffer, 0x0132)

but I get

('\x00', '\x00', "'", '\x88, ...[and more crap])

Any ideas?

Edit

Many thanks for the thorough effort! The answers are phenomenal and I learned a lot about handling binary data.

+4  A: 

0x0132 is not the offset, it's the tag number of the date. CR2 or TIFF, respectively, is a directory based format. You have to look up the entry given your the (known) tag you are looking for.

Edit: Ok, first of all, you have to read if the file data is saved using little or big-endian format. The first eight byte specify the header, and the first two byte of that header specify the endianness. Python's struct module allows you to handle little and big endian data by prefixing a format string with either '<' or '>'. So, assuming data is a buffer containing your CR2 image, you can handle endianness via

header = data[:8]
endian_flag = "<" if header[:2] == "II" else ">"

The format specification states that the first image file directory begins at an offset relative to the beginning of the file, with the offset being specified in the last 4 bytes of the header. So, to get the offset to the first IFD, you can use a line similar to this one:

ifd_offset = struct.unpack("{0}I".format(endian_flag), header[4:])[0]

You can now go ahead and read the first IFD. You will find the number of entries in the directory at the specified offset into the file, which is two bytes wide. Thus, you would read the number of entries in the first IFD using:

number_of_entries = struct.unpack("{0}H".format(endian_flag), data[ifd_offset:ifd_offset+2])[0]

A field entry is 12 bytes long, so you can calculate the length of the IFD. After number_of_entries * 12 bytes, there will be another 4 byte long offset, telling you where to look for the next directory. That is basically how you work with TIFF and CR2 images.

The "magic" here is to note that with each of the 12 byte field entries, the first two bytes will be the tag ID. And that is where you look for your tag 0x0132. So, given you know that the first IFD starts at ifd_offset in the file, you can scan the first directory via:

current_position = ifd_offset + 2
for field_offset in xrange(current_position, number_of_entries*12, 12):
    field_tag = struct.unpack("{0}H".format(endian_flag), data[field_offset:field_offset+2])[0]
    field_type = struct.unpack("{0}H".format(endian_flag), data[field_offset+2:field_offset+4])[0]
    value_count = struct.unpack("{0}I".format(endian_flag), data[field_offset+4:field_offset+8])[0]
    value_offset = struct.unpack("{0}I".format(endian_flag), data[field_offset+8:field_offset+12])[0]

    if field_tag == 0x0132:
        # You are now reading a field entry containing the date and time
        assert field_type == 2 # Type 2 is ASCII
        assert value_count == 20 # You would expect a string length of 20 here
        date_time = struct.unpack("20s", data[value_offset:value_offset+20])
        print date_time

You'd obviously want to refactor that unpacking into a common function and probably wrap the whole format into a nice class, but that is beyond the scope of this example. You can also shorten the unpacking by combining multiple format strings into one, yielding a larger tuple containing all the fields you can unpack into distinct variables, which I left out for clarity.

Jim Brissom
Can you provide an example? I am at a complete loss here ... Thanks!
Arrieta
+1: for the endian check I was too lazy to implement :-)
Jon Cage
Great - I wish I could upvote more.
Arrieta
+3  A: 

Have you taken into account the header which should (according to the spec) precede the IFD block you're talking about?

I looked through the spec and it says the first IFD block follows the 16 byte header. So if we read bytes 16 and 17 (at offset 0x10 hex) we should get the number of entries in the first IFD block. Then we just have to search through each entry until we find a matching tag id which (as I read it) gives us the byte offset of your date / time string.

This works for me:

from struct import *

def FindDateTimeOffsetFromCR2( buffer, ifd_offset ):
    # Read the number of entries in IFD #0
    (num_of_entries,) = unpack_from('H', buffer, ifd_offset)
    print "ifd #0 contains %d entries"%num_of_entries

    # Work out where the date time is stored
    datetime_offset = -1
    for entry_num in range(0,num_of_entries-1):
        (tag_id, tag_type, num_of_value, value) = unpack_from('HHLL', buffer, ifd_offset+2+entry_num*12)
        if tag_id == 0x0132:
            print "found datetime at offset %d"%value
            datetime_offset = value
    return datetime_offset

if __name__ == '__main__':
    with open("IMG_6113.CR2", "rb") as f:
        buffer = f.read(1024) # read the first 1kb of the file should be enough to find the date / time
        datetime_offset = FindDateTimeOffsetFromCR2(buffer, 0x10)
        print unpack_from(20*'s', buffer, datetime_offset)

Output for my example file is:

ifd #0 contains 14 entries
found datetime at offset 250
('2', '0', '1', '0', ':', '0', '8', ':', '0', '1', ' ', '2', '3', ':', '4', '5', ':', '4', '6', '\x00')

[edit] - a revised / more thorough example

from struct import *

recognised_tags = { 
    0x0100 : 'imageWidth',
    0x0101 : 'imageLength',
    0x0102 : 'bitsPerSample',
    0x0103 : 'compression',
    0x010f : 'make',    
    0x0110 : 'model',
    0x0111 : 'stripOffset',
    0x0112 : 'orientation', 
    0x0117 : 'stripByteCounts',
    0x011a : 'xResolution',
    0x011b : 'yResolution',
    0x0128 : 'resolutionUnit',
    0x0132 : 'dateTime',
    0x8769 : 'EXIF',
    0x8825 : 'GPS data'};

def GetHeaderFromCR2( buffer ):
    # Unpack the header into a tuple
    header = unpack_from('HHLHBBL', buffer)

    print "\nbyte_order = 0x%04X"%header[0]
    print "tiff_magic_word = %d"%header[1]
    print "tiff_offset = 0x%08X"%header[2]
    print "cr2_magic_word = %d"%header[3]
    print "cr2_major_version = %d"%header[4]
    print "cr2_minor_version = %d"%header[5]
    print "raw_ifd_offset = 0x%08X\n"%header[6]

    return header

def FindDateTimeOffsetFromCR2( buffer, ifd_offset, endian_flag ):
    # Read the number of entries in IFD #0
    (num_of_entries,) = unpack_from(endian_flag+'H', buffer, ifd_offset)
    print "Image File Directory #0 contains %d entries\n"%num_of_entries

    # Work out where the date time is stored
    datetime_offset = -1

    # Go through all the entries looking for the datetime field
    print " id  | type |  number  |  value   "
    for entry_num in range(0,num_of_entries):

        # Grab this IFD entry
        (tag_id, tag_type, num_of_value, value) = unpack_from(endian_flag+'HHLL', buffer, ifd_offset+2+entry_num*12)

        # Print out the entry for information
        print "%04X | %04X | %08X | %08X "%(tag_id, tag_type, num_of_value, value),
        if tag_id in recognised_tags:
            print recognised_tags[tag_id]

        # If this is the datetime one we're looking for, make a note of the offset
        if tag_id == 0x0132:
            assert tag_type == 2
            assert num_of_value == 20
            datetime_offset = value

    return datetime_offset

if __name__ == '__main__':
    with open("IMG_6113.CR2", "rb") as f:
        # read the first 1kb of the file should be enough to find the date/time
        buffer = f.read(1024) 

        # Grab the various parts of the header
        (byte_order, tiff_magic_word, tiff_offset, cr2_magic_word, cr2_major_version, cr2_minor_version, raw_ifd_offset) = GetHeaderFromCR2(buffer)

        # Set the endian flag
        endian_flag = '@'
        if byte_order == 0x4D4D:
            # motorola format
            endian_flag = '>'
        elif byte_order == 0x4949:
            # intel format
            endian_flag = '<'

        # Search for the datetime entry offset
        datetime_offset = FindDateTimeOffsetFromCR2(buffer, 0x10, endian_flag)

        datetime_string = unpack_from(20*'s', buffer, datetime_offset)
        print "\nDatetime: "+"".join(datetime_string)+"\n"
Jon Cage
Thanks @Jon Cage. I'm afraid I don't know how to do that. How can I find out which block precedes?
Arrieta
It would be worth doing the endian check as per Jim's answer for a more robust solution, but my sample worked fine on an AMD Windows 7 machine :-)
Jon Cage
Man this is great - Thank you!
Arrieta