ansaurus

Question

Splitting a binary file on binary delimiter?

Answer 1

A:

I think a very simple home brew approach will be your best bet. The code for doing this would be very small, depending on all the special cases of your binary file format.

Use mmap to get a convenient view of your file in memory.
Start scanning, and save the byte-offset in a variable, say start.
Scan until you reach your delimiter, saving the ending offset, in say end.
Create a new file
Memory-map the new file
Copy the byte-range from start to end into the new file.
Close the new file and start scanning again.

Noah Watkins 2010-09-06 15:27:29

Answer 2

+3 A:

You can do this using bbe (http://bbe-.sourceforge.net/) which is a sed like program for binary files:

In order to extract the first JPEG use:

bbe -b '/\xFF\xD8\xFF\xE1/:' -e 'D 2' -o first_jpeg mpo_file

And for the second one:

bbe -b '/\xFF\xD8\xFF\xE1/:' -e 'D 1' -o second_jpeg mpo_file

Note that this will not work if the JPEG's magic number occurs somewhere else in the MPO file.

Bart Sas 2010-09-06 17:57:31

Thanks for bringing bbe to my attention! By the way, your Sourceforge link is broken, and bbe.sf.net is a different project.

drinian 2010-09-07 14:09:11

Answer 3

+2 A:

I think that Bart is on to your biggest problem.. If that binary sequence repeats during the process, you will get partial JPEGs.

I did a quick test by concatenating some JPEGs and then extracting them with awk (please note that the magic number in my files ended in 0xE0 and not 0xE1):

   # for i in *.jpg ; do cat $i ; done > test.mpo 
   # awk 'BEGIN {RS="\xFF\xD8\xFF\xE0"; FILENUM=-1} {FILENUM++; if (FILENUM == 0) {next}; FILENAME="image0"FILENUM".jpg"; printf "%s",RS$0 > FILENAME;}' test.mpo  
   # file image0*.jpg
    image01.jpg:  JPEG image data, JFIF standard 1.01
    image010.jpg: JPEG image data, JFIF standard 1.01
    image011.jpg: JPEG image data, JFIF standard 1.01

This seemed to work ok for me, but the above mentioned issues are still unhandled and very real.

phreakocious 2010-09-06 22:18:56

I guess 0xE1 in the magic number indicates that it's the second image in the sequence and you never have more than 2 images. Adjust as needed. =)

phreakocious 2010-09-07 03:16:58

I'm not sure about that, because I also see 0xE1 at the beginning of the file.

drinian 2010-09-07 11:54:51

I'm giving you the answer check because awk is available on every Unix system (and it reminds me that I need to learn more about it :). My shell script is currently doing some rudimentary checks for an image03.jpg or lack of image02.jpg and aborting, which helps to handle the magic number problem. I could also do some checking for EXIF headers. Unfortunately, I'm only aware of one program that can read these files natively -- the Fujifilm Windows app -- although Wikipedia claims that Digikam supports MPO. Will have to look at their source, and my camera's documentation. For now, this is good.

drinian 2010-09-07 14:05:12

Glad it worked out for you.. http://en.wikipedia.org/wiki/Magic_number_%28programming%29#Examples says that 0xFF 0xD8 is the beginning of the JPEG magic number, so it stands to reason that what follows it is up to the implementation.

phreakocious 2010-09-07 21:43:43

ansaurus

tags:

views:

answers:

Splitting a binary file on binary delimiter?

related questions