views:

2850

answers:

8

I have a binary file that was created on a unix machine. It's just a bunch of records written one after another. The record is defined something like this:

struct RECORD {
  UINT32 foo;
  UINT32 bar;
  CHAR fooword[11];
  CHAR barword[11];
  UNIT16 baz;
}

I am trying to figure out how I would read and interpret this data on a Windows machine. I have something like this:

fstream f;
f.open("file.bin", ios::in | ios::binary);

RECORD r;

f.read((char*)&detail, sizeof(RECORD));

cout << "fooword = " << r.fooword << endl;

I get a bunch of data, but it's not the data I expect. I'm suspect that my problem has to do with the endian difference of the machines, so I've come to ask about that.

I understand that multiple bytes will be stored in little-endian on windows and big-endian in a unix environment, and I get that. For two bytes, 0x1234 on windows will be 0x3412 on a unix system.

Does endianness affect the byte order of the struct as a whole, or of each individual member of the struct? What approaches would I take to convert a struct created on a unix system to one that has the same data on a windows system? Any links that are more in depth than the byte order of a couple bytes would be great, too!

+2  A: 

It affects each member independently, not the whole struct. Also, it does not affect things like arrays. For instance, it just makes bytes in an ints stored in reverse order.

PS. That said, there could be a machine with weird endianness. What I just said applies to most used machines (x86, ARM, PowerPC, SPARC).

Mehrdad Afshari
"Also, it does not affect things like arrays.": But it affects members of arrays if they are of numeric data types or characters with size > 1 byte!
rstevens
@rstevens: Yes, absolutely. I mean it does not affect the order of elements in an array. Each member is obviously treated like a single variable.
Mehrdad Afshari
+4  A: 

Actually, endianness is a property of the underlying hardware, not the OS.

The best solution is to convert to a standard when writing the data -- Google for "network byte order" and you should find the methods to do this.

Edit: here's the link: http://www.gnu.org/software/hello/manual/libc/Byte-Order.html

kdgregory
I don't get to decide how to write the data, that process has been in place for 10 years, and it's not changing.
scottm
In which case you need to discover the exact mechanism that was used, and write your own routines to convert (or find them online). Note, however, that while the writer "is not changing," it better never move to another architecture or it will change, like it or not.
kdgregory
+1  A: 

You have to correct the endianess of each member of more than one byte, individually. Strings do not need to be converted (fooword and barword), as they can be seen as sequences of bytes.

However, you must take care of another problem: aligmenent of the members in your struct. Basically, you must check if sizeof(RECORD) is the same on both unix and windows code. Compilers usually provide pragmas to define the aligment you want (for example, #pragma pack).

Jem
+4  A: 

As well as the endian, you need to be aware of padding differences between the two platforms. Particularly if you have odd length char arrays and 16 bit values, you may well find different numbers of pad bytes between some elements.

Edit: if the structure was written out with no packing, then it should be fairly straightforward. Something like this (untested) code should do the job:

// Functions to swap the endian of 16 and 32 bit values

inline void SwapEndian(UINT16 &val)
{
 val = (val<<8) | (val>>8);
}

inline void SwapEndian(UINT32 &val)
{
 val = (val<<24) | ((val<<8) & 0x00ff0000) |
    ((val>>8) & 0x0000ff00) | (val>>24);
}

Then, once you've loaded the struct, just swap each element:

SwapEndian(r.foo);
SwapEndian(r.bar);
SwapEndian(r.baz);
James Sutherland
I have #pragma pack(push, 1) specified.
scottm
@Scotty, that's not going to help you if the data you are reading already has slack bytes IN it. FWIW, this really shouldn't happen unless the developer of the program was writing out fulls structs, which is just bad. Structs should always be written out field by field - for situations exactly like this.
Duck
@Duck, I have the source of the definition of the structure (but not for reading or writing them) and it also has pack = 1.
scottm
+1  A: 

You also have to consider alignment differences between the two compilers. Each compiler is allowed to insert padding between members in a structure the best suits the architecture. So you really need to know:

  • How the UNIX prog writes to the file
  • If it is a binary copy of the object the exact layout of the structure.
  • If it is a binary copy what the endian-ness of the source architecture.

This is why most programs (That I have seen (that need to be platform neutral)) serialize the data as a text stream that can be easily read by the standard iostreams.

Martin York
A: 

Something like this should work:

#include <algorithm>

struct RECORD {
    UINT32 foo;
    UINT32 bar;
    CHAR fooword[11];
    CHAR barword[11];
    UINT16 baz;
}

void ReverseBytes( void *start, int size )
{
    char *beg = start;
    char *end = beg + size;

    std::reverse( beg, end );
}

int main() {
    fstream f;
    f.open( "file.bin", ios::in | ios::binary );

    // for each entry {
    RECORD r;
    f.read( (char *)&r, sizeof( RECORD ) );
    ReverseBytes( r.foo, sizeof( UINT32 ) );
    ReverseBytes( r.bar, sizeof( UINT32 ) );
    ReverseBytes( r.baz, sizeof( UINT16 )
    // }

    return 0;
}
kitchen
+1  A: 

I like to implement a SwapBytes method for each data type that needs swapping, like this:

inline u_int ByteSwap(u_int in)
{
    u_int out;
    char *indata = (char *)&in;
    char *outdata = (char *)&out;
    outdata[0] = indata[3] ;
    outdata[3] = indata[0] ;

    outdata[1] = indata[2] ;
    outdata[2] = indata[1] ;
    return out;
}

inline u_short ByteSwap(u_short in)
{
    u_short out;
    char *indata = (char *)&in;
    char *outdata = (char *)&out;
    outdata[0] = indata[1] ;
    outdata[1] = indata[0] ;
    return out;
}

Then I add a function to the structure that needs swapping, like this:

struct RECORD {
  UINT32 foo;
  UINT32 bar;
  CHAR fooword[11];
  CHAR barword[11];
  UNIT16 baz;
  void SwapBytes()
  {
    foo = ByteSwap(foo);
    bar = ByteSwap(bar);
    baz = ByteSwap(baz);
  }
}

Then you can modify your code that reads (or writes) the structure like this:

fstream f;
f.open("file.bin", ios::in | ios::binary);

RECORD r;

f.read((char*)&detail, sizeof(RECORD));
r.SwapBytes();

cout << "fooword = " << r.fooword << endl;

To support different platforms you just need to have a platform specific implementation of each ByteSwap overload.

kevin42
A: 

Don't read directly into struct from a file! The packing might be different, you have to fiddle with pragma pack or similar compiler specific constructs. Too unreliable. A lot of programmers get away with this since their code isn't compiled in wide number of architectures and systems, but that doesn't mean it's OK thing to do!

A good alternative approach is to read the header, whatever, into a buffer and parse from three to avoid the I/O overhead in atomic operations like reading a unsigned 32 bit integer!

char buffer[32];
char* temp = buffer;  

f.read(buffer, 32);  

RECORD rec;
rec.foo = parse_uint32(temp); temp += 4;
rec.bar = parse_uint32(temp); temp += 4;
memcpy(&rec.fooword, temp, 11); temp += 11;
memcpy(%red.barword, temp, 11); temp += 11;
rec.baz = parse_uint16(temp); temp += 2;

The declaration of parse_uint32 would look like this:

uint32 parse_uint32(char* buffer)
{
  uint32 x;
  // ...
  return x;
}

This is a very simple abstraction, it doesn't cost any extra in practise to update the pointer as well:

uint32 parse_uint32(char*& buffer)
{
  uint32 x;
  // ...
  buffer += 4;
  return x;
}

The later form allows cleaner code for parsing the buffer; the pointer is automatically updated when you parse from the input.

Likewise, memcpy could have a helper, something like:

void parse_copy(void* dest, char*& buffer, size_t size)
{
  memcpy(dest, buffer, size);
  buffer += size;
}

The beauty of this kind of arrangement is that you can have namespace "little_endian" and "big_endian", then you can do this in your code:

using little_endian;
// do your parsing for little_endian input stream here..

Easy to switch endianess for the same code, though, rarely needed feature.. file-formats usually have a fixed endianess anyway.

DO NOT abstract this into class with virtual methods; would just add overhead, but feel free to if so inclined:

little_endian_reader reader(data, size);
uint32 x = reader.read_uint32();
uint32 y = reader.read_uint32();

The reader object would obviously just be a thin wrapper around pointer. The size parameter would be for error checking, if any. Not really mandatory for the interface per-se.

Notice how the choise of endianess here was done at COMPILATION TIME (since we create little_endian_reader object), so we invoke the virtual method overhead for no particularly good reason, so I wouldn't go with this approach. ;-)

At this stage there is no real reason to keep the "fileformat struct" around as-is, you can organize the data to your liking and not necessarily read it into any specific struct at all; after all, it's just data. When you read files like images, you don't really need the header around.. you should have your image container which is same for all file types, so the code to read a specific format should just read the file, interpret and reformat the data & store the payload. =)

I mean, does this look complicated?

uint32 xsize = buffer.read<uint32>();
uint32 ysize = buffer.read<uint32>();
float aspect = buffer.read<float>();

The code can look that nice, and be a really low-overhead! If the endianess is same for file and architecture the code is compiled for, the innerloop can look like this:

uint32 value = *reinterpret_cast<uint32*>)(ptr); ptr += 4;
return value;

That might be illegal on some architectures, so that optimization might be a Bad Idea, and use slower, but more robust approach:

uint32 value = ptr[0] | (static_cast<uint32>(ptr[1]) << 8) | ...; ptr += 4;
return value;

On a x86 that can compile into bswap or mov, which is reasonably low-overhead if the method is inlined; the compiler would insert "move" node into the intermediate code, nothing else, which is fairly efficient. If alignment is a problem the full read-shift-or sequence might get generated, outch, but still not too shabby. Compare-branch could allow the optimization, if test the address LSB's and see if can use the fast or slow version of the parsing. But this would mean penalty for the test in every read. Might not be worth the effort.

Oh, right, we are reading HEADERS and stuff, I don't think that is a bottleneck in too many applications. If some codec is doing some really TIGHT innerloop, again, reading into a temporary buffer and decoding from there is well-adviced. Same principle.. no one reads byte-at-time from file when processing a large volume of data. Well, actually, I seen that kind of code very often and the usual reply to "why you do it" is that the file systems do block reads and that the bytes come from memory anyway, true, but they go through a deep call stack which is high-overhead for getting a few bytes!

Still, write the parser code once and use zillion times -> epic win.

Reading directly into struct from a file: DON'T DO IT FOLKS!