tags:

views:

52

answers:

1

I currently have a Python app I am developing which will data carve a block device for jpeg files. Let's just say that it sometimes works and sometimes doesn't. I have created it so that I read the block device till I find a ffd8, then I keep the stream open and search via looping for the ffd9 closure. Though I always need to take into account all ffd9 closures even after the first. So it tends to be a really intensive operation. Given a device with let's say 25 jpegs as well as lots of other data, the looping is pretty dramatic and it runs though a lot.

The program is not the slowest thing in the world, but I think it could be much faster and much more efficient. I am looking for a better way to search the block device and extract the data in a more efficient manner. I also don't want to kill the HDD or the drive holding the image of the block device.

So does anybody knew of a better way to systematically handle the searching and extraction of the data?

+2  A: 

The trouble with reading the block device directly is that there is no guarantee that the blocks of any given file are contiguous. That means that even if you find your magic marker bytes 0xFFD8 in block 13, say, there is no guarantee that block 14 belongs to the same file, whether or not it contains the 0xFFD9 end marker or not. (Most files will start on a block boundary; the end of the file may be anywhere, possibly even across block boundaries.)

What's the better way to deal with it? Well, it depends what you're after - but if you're looking only at currently allocated blocks, then scan the file system using the Python analog of the POSIX C function ftw (nftw), and read each file in turn. This won't find evidence of deleted JPEG files in the free list - if that's what you are after, then you'll need to do as you are doing, more or less, but correlate that information with what you find in the file system proper. Mapping those blocks will (at best) be hard.

Jonathan Leffler