tags:

views:

110

answers:

4

I'm reading in a binary file (a jpg in this case), and need to find some values in that file. For those interested, the binary file is a jpg and I'm attempting to pick out its dimensions by looking for the binary structure as detailed here.

I need to find FFC0 in the binary data, skip ahead some number of bytes, and then read 4 bytes (this should give me the image dimensions).

What's a good way of searching for the value in the binary data? Is there an equivalent of 'find', or something like re?

+2  A: 

You could actually load the file into a string and use

s.find('\xff\xc0')
David Zaslavsky
If it's a really big file, it's not such a good idea to read it into a string all at once.
icktoofay
I doubt it's so big it's going to be a problem.
Chris B.
Since I'm only looking for the first frame I'll likely be able to read some small part of the file and process that instead of reading the whole file.
Parand
@icktoofay: good point, but I would point out that you can do exactly what Parand is saying, just read the first N bytes and search those. If you did have to search all of a large file for a byte sequence, it could be done iteratively so you wouldn't have to keep the whole thing in memory at once, but the code would be a little more involved, and I didn't think it'd be necessary to get into that here.
David Zaslavsky
@David: Exactly. I was just saying that it would be better to read/scan it in small chunks.
icktoofay
+2  A: 

The re module does work with both string and binary data (str in Python 2 and bytes in Python 3), so you can use it as well as str.find for your task.

Andrey Vlasovskikh
+2  A: 

Well, obviously there is PIL The Image module has size as an attribute. If you are wanting to get the size exactly how you suggest and without loading the file you are going to have to go through it line by line. Not the nicest way to do it but it would work.

fridder
+3  A: 

The bitstring module was designed for pretty much this purpose. For your case the following code (which I haven't tested) should help illustrate:

from bitstring import Bits
# Can initialise from files, bytes, etc.
s = Bits(filename='your_file')
# Search to Start of Frame 0 code on byte boundary
found = s.find('0xffc0', bytealigned=True)
if found:
    print("Found start code at byte offset %d." % found[0])
    s0f0, length, bitdepth, height, width = s.readlist('hex:16, uint:16, 
                                                        uint:8, 2*uint:16')
    print("Width %d, Height %d" % (width, height))
Scott Griffiths
So `Bits.find` returns just a boolean and sets the `Bits.bytepos` attribute? Perhaps in the module documentation you should warn that `bitstring` is not thread-safe (not that it matters in this answer, of course).
ΤΖΩΤΖΙΟΥ
@ΤΖΩΤΖΙΟΥ: Yes you have a good point. I don't find it surprising that mutating methods or reading methods aren't thread safe, but using 'find' on a bit-wise immutable object could reasonably be expected to be. To be honest it's never cropped up before but it is something to think about...
Scott Griffiths
Just an idea: `find` could return an object with all necessary information, à la `re.match` and `re.search`. You could have this “BitMatch” class be a subclass of `bool`, for backwards compatibility.
ΤΖΩΤΖΙΟΥ
@ΤΖΩΤΖΙΟΥ: Thanks, that's a reasonable idea although I'm in a good position to break backward compatibility slightly and maybe just have it return the bit position as a single item tuple if found or an empty tuple if not found. I guess anything's better than returning -1 if not found :)
Scott Griffiths