views:

118

answers:

3

I'm working on a shell script to convert MPO stereographic 3D images into standard JPEG images. A MPO file is just two JPEG images, concatenated together.

As such, you can split out the JPEG files by finding the byte offset of the second JPEG's magic number header (0xFFD8FFE1). I've done this manually using hexdump/xxd, grep, head, and tail.

The problem here is grep: what can I use to search a binary directly for a specific magic number, and get back a byte offset? Or should I not use a shell script for this at all? Thanks.

A: 

I think a very simple home brew approach will be your best bet. The code for doing this would be very small, depending on all the special cases of your binary file format.

  1. Use mmap to get a convenient view of your file in memory.
  2. Start scanning, and save the byte-offset in a variable, say start.
  3. Scan until you reach your delimiter, saving the ending offset, in say end.
  4. Create a new file
  5. Memory-map the new file
  6. Copy the byte-range from start to end into the new file.
  7. Close the new file and start scanning again.
Noah Watkins
+3  A: 

You can do this using bbe (http://bbe-.sourceforge.net/) which is a sed like program for binary files:

In order to extract the first JPEG use:

bbe -b '/\xFF\xD8\xFF\xE1/:' -e 'D 2' -o first_jpeg mpo_file

And for the second one:

bbe -b '/\xFF\xD8\xFF\xE1/:' -e 'D 1' -o second_jpeg mpo_file

Note that this will not work if the JPEG's magic number occurs somewhere else in the MPO file.

Bart Sas
Thanks for bringing bbe to my attention! By the way, your Sourceforge link is broken, and bbe.sf.net is a different project.
drinian
+2  A: 

I think that Bart is on to your biggest problem.. If that binary sequence repeats during the process, you will get partial JPEGs.

I did a quick test by concatenating some JPEGs and then extracting them with awk (please note that the magic number in my files ended in 0xE0 and not 0xE1):

   # for i in *.jpg ; do cat $i ; done > test.mpo 
   # awk 'BEGIN {RS="\xFF\xD8\xFF\xE0"; FILENUM=-1} {FILENUM++; if (FILENUM == 0) {next}; FILENAME="image0"FILENUM".jpg"; printf "%s",RS$0 > FILENAME;}' test.mpo  
   # file image0*.jpg
    image01.jpg:  JPEG image data, JFIF standard 1.01
    image010.jpg: JPEG image data, JFIF standard 1.01
    image011.jpg: JPEG image data, JFIF standard 1.01

This seemed to work ok for me, but the above mentioned issues are still unhandled and very real.

phreakocious
I guess 0xE1 in the magic number indicates that it's the second image in the sequence and you never have more than 2 images. Adjust as needed. =)
phreakocious
I'm not sure about that, because I also see 0xE1 at the beginning of the file.
drinian
I'm giving you the answer check because awk is available on every Unix system (and it reminds me that I need to learn more about it :). My shell script is currently doing some rudimentary checks for an image03.jpg or lack of image02.jpg and aborting, which helps to handle the magic number problem. I could also do some checking for EXIF headers. Unfortunately, I'm only aware of one program that can read these files natively -- the Fujifilm Windows app -- although Wikipedia claims that Digikam supports MPO. Will have to look at their source, and my camera's documentation. For now, this is good.
drinian
Glad it worked out for you.. http://en.wikipedia.org/wiki/Magic_number_%28programming%29#Examples says that 0xFF 0xD8 is the beginning of the JPEG magic number, so it stands to reason that what follows it is up to the implementation.
phreakocious