tags:

views:

995

answers:

7

I'm working on a file format that should be written and read in several different operating systems and computers. Some of those computers should be x86 machines, others x86-64. Some other processors may exist, but I'm not concerned about them yet.

This file format should contain several numbers that would be read like this:

struct LongAsChars{
    char c1, c2, c3, c4;
};

long readLong(FILE* file){
    int b1 = fgetc(file);
    int b2 = fgetc(file);
    int b3 = fgetc(file);
    int b4 = fgetc(file);
    if(b1<0||b2<0||b3<0||b4<0){
     //throwError
    }

    LongAsChars lng;
    lng.c1 = (char) b1;
    lng.c2 = (char) b2;
    lng.c3 = (char) b3;
    lng.c4 = (char) b4;

    long* value = (long*) &lng;

    return *value;
}

and written as:

void writeLong(long x, FILE* f){
    long* xptr = &x;
    LongAsChars* lng = (LongAsChars*) xptr;
    fputc(lng->c1, f);
    fputc(lng->c2, f);
    fputc(lng->c3, f);
    fputc(lng->c4, f);
}

Although this seems to be working on my computer, I'm concerned that it may not in others or that the file format may end up being different across computers(32 bits vs 64 bits computers, for example). Am I doing something wrong? How should I implement my code to use a constant number of bytes per number?

Should I just use fread(which would possibly make my code faster too) instead?

+7  A: 

Use the types in stdint.h to ensure you get the same number of bytes in and out.

Then you're just left with dealing with endianness issues, which you code probably doesn't really handle.

Serializing the long with an aliased char* leaves you with different byte orders in the written file for platforms with different endianess.

You should decompose the bytes something like so:

char c1 = (val >>  0) & 0xff;
char c2 = (val >>  8) & 0xff;
char c3 = (val >> 16) & 0xff;
char c4 = (val >> 24) & 0xff;

And recompose then using something like:

val = (c4 << 24) |
      (c3 << 16) |
      (c2 <<  8) |
      (c1 <<  0);
Michael Burr
I think a union works much better.
GMan
@GMan - don't you have the same problem with a union (unless you conditionally compile a different definition of the union based on the platform's endianess)?
Michael Burr
The reference to stdint was very useful and it'll help a lot!
luiscubal
Use unsigned chars or sign extension will bite you.
George Phillips
@George - are you sure? However, now that you mention it, I think the recompose example will have a problem if sizeof( int) < sizeof( long). I'll fix that in a bit...
Michael Burr
+1  A: 

You might also run into issues with endianness. Why not just use something like NetCDF or HDF, which take care of any portability issues that may arise?

Pete
+1  A: 

Rather than using structures with characters in them, consider a more mathematical approach:

long l  = fgetc() << 24;
     l |= fgetc() << 16;
     l |= fgetc() <<  8;
     l |= fgetc() <<  0;

This is a little more direct and clear about what you are trying to accomplish. It can also be implemented in a loop to handle larger numbers.

Chris Arguin
This reads the file in big-endian format. Which is maybe a good thing, but it would still be faster to read a whole `long` and then `bswap` it in memory.
ephemient
@ephemient: assuming you need to bswap it ( what if you are big-endian?). Also assuming bswap works (what if your long is 64 bits?, or you are on some forsaken middle-endian machine? )
Chris Arguin
Well, I was thinking "`bswap` if necessary", but that's obviously not what I wrote, and I try not to think about middle-endian machines (have they existed in the last two decades?) What about `s/bswap/ntohl/`? As far as I can tell, common implementations of it drop the high 32 bits if given a 64-bit value, which is the right thing to do.
ephemient
+2  A: 

Well you can use a union, for one:

union LongAsChars{
    long l;
    char c1, c2, c3, c4;
};

And it's more traditional to use an array, I think:

union LongAsChars{
    long l;
    char c[4];
};

Which makes your routine something like this (no compiler on-hand to test):

long readLong(FILE* file){

    LongAsChars lng;

    for (unsigned i = 0; i < 4; ++i)
    {
        lng.c[i] = fgetc(file);
        if (lng.c[i] < 0)
        {
            //throwError
        }


    }

    return lng.l;
}

void writeLong(long x, FILE* f){

    LongAsChars lng;
    lng.l = x;

    for (unsigned i = 0; i < 4; ++i)
    {
        fputc(lng.c[i], f);
    }
}

The only issues you'll get with standard types deal with endianness.

Also, unless I'm missing something, yes, just read and write the long value directly, no need to chop things up, which at best just makes things confusing:

long readLong(FILE* file){

    long x;

    fread(&x, sizeof(long), 1, file);    

    return x;
}

void writeLong(long x, FILE* f){

    fwrite(&x, sizeof(long), 1, file);    

    return x;
}
GMan
The last (simple) code goes wrong if the file is written on a platform where sizeof(long) == 4, such as 64bit Windows, but read on a platform where sizeof(long) == 8, such as 64bit linux.
Steve Jessop
A: 

You don't want to use long int. That can be different sizes on different platforms, so is a non-starter for a platform-independent format. You have to decide what range of values needs to be stored in the file. 32 bits is probably easiest.

You say you aren't worried about other platforms yet. I'll take that to mean you want to retain the possibility of supporting them, in which case you should define the byte-order of your file format. x86 is little-endian, so you might think that's the best. But big-endian is the "standard" interchange order if anything is, since it's used in networking.

If you go for big-endian ("network byte order"):

// can't be bothered to support really crazy platforms: it is in
// any case difficult even to exchange files with 9-bit machines,
// so we'll cross that bridge if we come to it.
assert(CHAR_BIT == 8);
assert(sizeof(uint32_t) == 4);

{
    // write value
    uint32_t value = 23;
    const uint32_t networkOrderValue = htonl(value);
    fwrite(&networkOrderValue, sizeof(uint32_t), 1, file);
}

{
    // read value
    uint32_t networkOrderValue;
    fread(&networkOrderValue, sizeof(uint32_t), 1, file);
    uint32_t value = ntohl(networkOrderValue);
}

Actually, you don't even need to declare two variables, it's just a bit confusing to replace "value" with its network order equivalent in the same variable.

It works because "network byte order" is defined to be whatever arrangement of bits results in an interchangeable (big-endian) order in memory. No need to mess with unions because any stored object in C can be treated as a sequence of char. No need to special-case for endianness because that's what ntohl/htonl are for.

If this is too slow, you can start thinking about fiendishly optimised platform-specific byte-swapping, with SIMD or whatever. Or using little-endian, on the assumption that most of your platforms will be little-endian and so it's faster "on average" across them. In that case you'll need to write or find "host to little-endian" and "little-endian to host" functions, which of course on x86 just do nothing.

Steve Jessop
A: 

I believe the most cross architecture approach is to use the uintXX_t types, as defined in stdint.h. See man page here. For example a int32_t will give you a 32 bit integer on x86 and x86-64. I use these by default now in all of my code and have had no troubles, as they are fairly standard across all *NIX.

James
A: 

Assuming sizeof(uint32_t) == 4, there are 4!=24 possible byte orders, of which little-endian and big-endian are the most prominent examples, but others have been used as well (e.g. PDP-endian).

Here are functions for reading and writing 32 bit unsigned integers from a stream, heeding an arbitrary byte order which is specified by the integer whose representation is the byte sequence 0,1,2,3: endian.h, endian.c

The header defines these prototypes

_Bool read_uint32(uint32_t * value, FILE * file, uint32_t order);
_Bool write_uint32(uint32_t value, FILE * file, uint32_t order);

and these constants

LITTLE_ENDIAN
BIG_ENDIAN
PDP_ENDIAN
HOST_ORDER
Christoph