views:

572

answers:

4

I was attempting to do a sed replacement in a binary file however I am beginning to believe that is not possible. Essentially what I wanted to do was similar to the following:

sed -bi "s/\(\xFF\xD8[[:xdigit:]]\{1,\}\xFF\xD9\)/\1/" file.jpg

The logic I wish to achieve is: scan through a binary file until the hex code FFD8, continue reading until FFD9, and only save what was between them (discards the junk before and after, but include FFD8 and FFD9 as the saved part of the file)

Is there a good way to do this? Even if not using sed?

EDIT: I just was playing around and found the cleanest way to do it IMO. I am aware that this grep statement will act greedy.

hexdump -ve '1/1 "%.2x"' dirty.jpg | grep -o "ffd8.*ffd9" | xxd -r -p > clean.jpg
A: 

Not having done binary seds before, I can't be sure, but that looks like it will replace all occurrences of what you want with those same occurrences, but leave the rest of the file as is. In other words, I don't think it will change the file at all.

I usually just code up a stdio filter program for small jobs, something like this (filter.c):

#include <stdio.h>
int main(void) {
    int saving = 0;
    int ch, lastch = -1;
    while ((ch = getchar()) != EOF) {
        if (saving) {
            if ((lastch == 0xff) && (ch == 0xd9))
                saving = 0;
            putchar (ch);
        } else {
            if ((lastch == 0xff) && (ch == 0xd8)) {
                saving = 1;
                putchar (lastch);
                putchar (ch);
            }
        }
        lastch = ch;
    }
    return 0;
}

Compile that then just run your input through it:

gcc -o filter filter.c
./filter <inputfile >outputfile

This is a pretty standard filter program which just starts off by echoing nothing. When it finds the character sequence 0xff/0xd8, it starts echoing. When it finds 0xff/0xd9, it stops.

Keep in mind this is what you asked for in the text - no account is taken as to whether it has hex digits only (as per your regex). If this is a problem, the filter program becomes a little more difficult inasmuch as you'll need to store all characters up to the closing 0xff/0xd9 and only output the lot if they were all valid hex digits.

Changing 0xff to 'x', 0xd8 to 'y', 0xd9 to 'z' (all to make debugging easier), then piping in :

"hello1xyhello2xzhello3xyhello4xzhello5"

gives you:

xyhello2xzxyhello4xz

as you would expect.

paxdiablo
+1  A: 

sed might be able to do it, but it could be tricky. Here's a Python script that does the same thing (note that it edits the file in-place, which is what I assume you want to do based on your sed script):

import re

f = open('file.jpeg', 'rb+')
data = f.read()
match = re.search('(\xff\xd8[0-9A-fa-f]+)\xff\xd9', data)
if match:
    result = match.group(1)
    f.seek(0)
    f.write(result)
    f.truncate()
else:
    print 'No match'
f.close()
Adam Rosenfield
awesome alternative, thank you!
Ryan
+4  A: 

Is there a good way to do this

yes of course, use an image editing tool such as those from ImageMagick (search the net for linux jpeg , exif editor etc) that knows how to edit jpg metadata. I am sure you can find one tool that suits you. Don't try to do this the hard way. :)

ghostdog74
agree, this is essentially random binary data so you've got a 1 / (2 ** 16) of getting a false positive when searching for any 2 byte sequence. That's about once every 65K of data.
snoopy
exiftool (http://search.cpan.org/dist/Image-ExifTool/exiftool) is the killer application for media metadata.
daxim
Just copying my above comment down here:FYI, the purpose of this question was for doing manual file carving in a RAID 5 scenario. When grabbing stripes and chunks you will get data before and after the jpg (or any other file). This was meant to clean it.
Ryan
A: 

Also, this Perl might work (not tested, caveat emptor)... if Python is not installed :)

open(FILE, "file.jpg") || die "no open $!\n";
while (read(FILE, $buff, 8 * 2**10)) {
    $content .= $buff;
}
@matches = ($content =~ /(\xFF\xD8[:xdigit:]+?\xFF\xD9)/g;
print STDOUT join("", @matches);

You need to add binmode(FILE); binmode(STDOUT); on DOS or VMS after the open() call - not needed on Unix.

DVK
I will give this a shot when I can, thank you for this alternative!
Ryan
Why the downvote? If this has a bug/doesn't work, please tell me details and i'll fix. If you think this is off-topic, re-read OP: "Even if not using sed?". If you're an anti-Perl bigot, don't be a coward and explain yourself
DVK
sorry DVK - that was me. I've been bitten by bugs myself when trying to grep for short patterns in binary data. Just think there's a good chance of this mismatching, either on one or other of the anchors or completely picking up a random 'phantom pattern'. I just think that Sooner or later the OP is likely to end up with the odd scrambled jpeg and wonder why! Also downvoted others for the same reason.
snoopy
If you're saying that OP has an XY problem, please present a better solution than a regex before downloading regex solutions as "bad". If this answer has a bug, please point it out. If there's a specific pattern where regexp approach would fail, please clarify that as an answer (again XY)
DVK
Also, please note that this solution does NOT change the jpg file. Merely outputs found strings (which I'm guessing might be metadata) to standard out for later redirect/consumption
DVK