views:

175

answers:

5

Hello World. I'm tasked with reading a poorly formatted binary file and taking in the variables. Although I need to do it in C++ (ROOT, specifically), I've decided to do it in python because python makes sense to me, but my plan is to get it working in python and then tackle re-writing in in C++, so using easy to use python modules won't get me too far later down the road.

Basically, I do this:

In [5]: some_value
Out[5]: '\x00I'

In [6]: ''.join([str(ord(i)) for i in some_value])
Out[6]: '073'

In [7]: int(''.join([str(ord(i)) for i in some_value]))
Out[7]: 73

And I know there has to be a better way. What do you think?

EDIT:

A bit of info on the binary format.

alt text alt text alt text

This is the endian test I am using:

# Read a uint32 for endianess
endian_test = rq1_file.read(uint32)
if endian_test == '\x04\x03\x02\x01':
    print "Endian test: \\x04\\x03\\x02\\x01"
    swapbits = True
elif endian_test == '\x01\x02\x03\x04':
    print "Endian test: \\x01\\x02\\x03\\x04"
    swapbits = False
+1  A: 

The equivalent to the Python struct module is a C struct and/or union, so being afraid to use it is silly.

Ignacio Vazquez-Abrams
OTOH being afraid to reverse-engineer C structs and unions that have been simply dumped into a file…
Tadeusz A. Kadłubowski
+2  A: 

You're basically computing a "number-in-base-256", which is a polynomial, so, by Horner's method:

>>> v = 0
>>> for c in someval: v = v * 256 + ord(c)

More typical would be to use equivalent bit-operations rather than arithmetic -- the following's equivalent:

>>> v = 0
>>> for c in someval: v = v << 8 | ord(c)
Alex Martelli
A: 

I'm not exactly sure how the format of the data is you want to extract, but maybe you better just write a couple of generic utility functions to extract the different data type you need:

def int1b(data, i):
   return ord(data[i])

def int2b(data, i):
   return (int1b(data, i) << 8) + int1b(data, i+1)

def int4b(data, i):
   return (int2b(data, i) << 16) + int2b(data, i+2)

With such functions you can easily extract values from the data and they also can be translated rather easily to C.

sth
Working with binary is not my forte. I'd appreciate it if you could elaborate on the bitwise operation you're doing. But I like it so far.
vgm64
@vgm64: The `<<` is the left shift operator, `x << 8` shifts the bits in the integer `x` eight positions to the left. So if `x` and `y` use each eight bits, in a 16 bit value `(x << 8) + y` the more significant eight bits will be occupied by `x` and the remaining eight bits will be set to `y`.
sth
@vgm64: You're dealing with raw binary data anyway. Take your time to learn bit operations.
Tadeusz A. Kadłubowski
+2  A: 
import struct
result, = struct.unpack('>H', some_value)
Oren
+1 but see Ignacio Vazquez-Abrams's answer.
MatrixFrog
+2  A: 

Your int(''.join([str(ord(i)) for i in some_value])) works ONLY when all bytes except the last byte are zero. Examples:
'\x01I' should be 1 * 256 + 73 == 329; you get 173
'\x01\x02' should be 1 * 256 + 2 == 258; you get 12
'\x01\x00' should be 1 * 256 + 0 == 256; you get 10

It also relies on an assumption that integers are stored in bigendian fashion; have you verified this assumption? Are you sure that '\x00I' represents the integer 73, and not the integer 73 * 256 + 0 == 18688 (or something else)? Please let us help you verify this assumption by telling us what brand and model of computer and what operating system were used to create the data.

How are negative integers represented?

Do you need to deal with floating-point numbers?

Is the requirement to write it in C++ immutable? What does "(ROOT, specifically)" mean?

If the only dictate is common sense, the preferred order would be:

  1. Write it in Python using the struct module.

  2. Write it in C++ but use C++ library routines (especially if floating-point is involved). Don't re-invent the wheel.

  3. Roll your own conversion routines in C++. You could snarf a copy of the C source for the Python struct module.

Update

Comments after the file format details were posted:

  1. The endianness marker is evidently optional, except at the start of a file. This is dodgy; it relies on the fact that if it is not there, the 3rd and 4th bytes of the block are the 1st 2 bytes of the header string, and neither '\x03\x04' nor '\x02\x01' can validly start a header string. The smart thing to do would be to read SIX bytes -- if first 4 are the endian marker, the next two are the header length, and your next read is for the header string; otherwise seek backwards 4 bytes then read the header string.

  2. The above is in the nuisance category. The negative sizes are a real worry, in that they specify a MAXIMUM length, and there is no mention of how the ACTUAL length is determined. It says "The actual size of the entry is then given line by line". How? There is no documentation of what a "line of data" looks like. The description mentions "lines" many times; are these lines terminated by carriage return and/or line feed? If so, how does one tell the difference between say a line feed byte and the first byte of say a uint16 that belongs to the current "line" of data? If no linefeed or whatever, how does one know when the current line of data is finished? Is there a uintNN size in front of every variable or slice thereof?

  3. Then it says that (2) above (negative size) also applies to the header string. The mind boggles. Do you have any examples (in documentation of the file layout, or in actual files) of "negative size" of (a) header string (b) data "line"?

  4. Is this "decided format" publically available e.g. documentation on the web? Does the format have a searchable name? Are you sure you are the first person in the world to want to read that format?

  5. Reading that file format, even with a full specification, is no trivial exercise, even for a binary-format-experienced person who's also experienced with Python (which BTW doesn't have a float128). How many person-hours have you been allocated for the task? What are the penalties for (a) delay (b) failure?

  6. Your original question involved fixing your interesting way of trying to parse a uint16 -- doing much more is way outside the scope/intention of what SO questions are all about.

John Machin
Thanks for the answer John. An endian test is done and if need by, the value is byte swapped before this algorithm. I don't think there are negative ints, but if there are I don't know their representation. There are floating point numbers. ROOT is a CERN framework used for analysis in High Energy Physics, written in C++.
vgm64
I'm analyzing data that came from an distant DAQ system, so I can't tell you much about the format or computer, which is part of the challenge. The reader I'm making mixes a bit of knowledge of what the binary output contains, but needs to be able to handle a variable binary structure at the same time. Basically, this thing is a pain and I am under qualified.
vgm64
@vgm64: Please explain "an endian test is done and if need [be]". Is done by whom/what? Does the DAQ put some kind of byte-order marker in the file(s)? If you are testing the byte order of the converting computer, don't; it's an unnecessary complication. Please answer the questions about endianness literally. We need to know what you know, not what you are doing. Please answer the must-use-C++ question. Is this a one-off job on a collection of data? If so, can't you use Python to convert the files to a better structure to be read by the presumably-existing C++ software?
John Machin
@vgm64 continued: If there are negative two-byte integers, it is very likely that -1 will appear as 65535, -2 as 65534, etc. After `num = byte0 * 256 + byte1` you would need `if num >= 32768: num -= 65536`. What do you know about the floating-point numbers other than that there are some? E.g. do they occupy 3, 4, 8, 10, 16 or some other number of bytes?
John Machin
@John Machin: I've included some information about the format and about the endian test. The first four bytes read are `\x04\x03\x02\x01` with the file I'm working with now. Yes, I must use C++. Because I am using ROOT which is C++. This is a decided format, creating an intermediate file with "bette structure" isn't doable (if I could read it to write a new format, well, then I've read it and processed it, which is my what I need help with in the first place). A bit of information about the floating-point numbers are included in the question too. Thanks.
vgm64
I'll just say that I agree with point 6, and the first sentence of point 5. I'm going to go on and continue to bash my head against my code for a while longer, but won't pursue this any longer on SO. I'll make progress eventually. Aren't you glad you don't deal with code (or binary output) from physicists? Thanks a lot for your help. I really like seeing people on SO that are wholly interested in helping others.
vgm64