tags:

views:

312

answers:

5

I have a legacy data structure that's 672 bytes long. These structs are stored in a file, sequentially, and I need to read them in.

While I can read them in one-by-one, it would be nice to do this:

// I know in advance how many structs to read in
vector<MyStruct> bunchOfStructs;
bunchOfStructs.resize(numberOfStructs);

ifstream ifs;
ifs.open("file.dat");
if (ifs) {
    ifs.read(&bunchOfStructs[0], sizeof(MyStruct) * numberOfStructs);
}

This works, but I think it only works because the data structure size happens to be evenly divisible by my compiler's struct alignment padding. I suspect it'll break on another compiler or platform.

The alternative would be to use a for loop to read in each struct one-at-a-time.

The question --> When do I have to be concerned about data alignment? Does dynamically allocated memory in a vector use padding or does STL guarantee that the elements are contiguous?

+2  A: 

For your existing file, your best bet is to figure out its file format, and to read each type in individually, read in and discard any alignment bytes.

It's best to not make any assumptions with struct alignment.

To save new data to a file, you could use something like boost serialization.

Brian R. Bondy
That sounds like the safe way. Slow and tedious, but safe. :-) I do know that there is no padding in the on-disk format.
Nate
+1  A: 

In your case, you need to be concerned about alignment whenever it might change the layout of your structure. There are two options to make your code more portable.

First, most compilers have extended attributes or preprocessor directives that will allow you to pack the structure into minimum space. This option potentially misaligns some of the fields within the structure, which might reduce performance, but will guarantee that it is laid out the same on any machine you build it for. Check your compiler for its documentation about #pragma pack(). In GCC you can use __attribute__((__packed__)).

Second, you can add explicit padding to your structure. This option allows you to maintain the performance properties of the original structure, but will make it unambiguous about how the structure is laid out. For example:

struct s {
    u_int8_t  field1;
    u_int8_t  pad0[3];
    u_int16_t field2;
    u_int8_t  pad1[2];
    u_int32_t field3;
};
Carl Norum
+1  A: 

More than alignment, you should worry about endianness. The STL guarantees that the storage in a vector is the same as an array, but the integer fields in the structure itself will be stored in different formats between say x86 and RISC.

As for the alignment thing, Google for #pragma pack(1).

Potatoswatter
+3  A: 

The standard requires you to be able to create an array of a struct type. When you do so, the array is required to be contiguous. That means, whatever size is allocated for the struct, it has to be one that allows you to create an array of them. To ensure that, the compiler can allocate extra space inside the structure, but cannot require any extra space between the structs.

The space for the data in a vector is (normally) allocated with ::operator new (via an Allocator class), and ::operator new is required to allocate space that's properly aligned to store any type.

You could supply your own Allocator and/or overload ::operator new -- but if you do, your version is still required to meet the same requirements, so it won't change anything in this respect.

In other words, exactly what you want is required to work as long as the data in the file was created in essentially the same way you're trying to read it back in. If it was created on another machine or with a different compiler (or even the same compiler with different flags) you have a fair number of potential problems -- you might get differences in endianness, padding in the struct, and so on.

Edit: Given that you don't know whether the structs have been written out in the format expected by the compiler, you not only need to read the structs one at a time -- you really need to read the items in the structs one at a time, then put each into a temporary struct, and finally add that filled-in struct to your collection.

Fortunately, you can overload operator>> to automate most of this. It doesn't improve speed (for example), but it can keep your code cleaner:

struct whatever { 
    int x, y, z;
    char stuff[672-3*sizeof(int)];

    friend std::istream &operator>>(std::istream &is, whatever &w) { 
       is >> w.x >> w.y >> w.z;
       return is.read(w.stuff, sizeof(w.stuff);
    } 
};

int main(int argc, char **argv) { 
    std::vector<whatever> data;

    assert(argc>1);

    std::ifstream infile(argv[1]);

    std::copy(std::istream_iterator<whatever>(infile),
              std::istream_iterator<whatever>(),
              std::back_inserter(data));  
    return 0;
}
Jerry Coffin
Perfect. I do know that there is no padding between the structs on disk, and no padding *inside* the structs on disk. But I suppose I have no portable way of knowing if the compiler is going to add padding inside the structs in memory. So it looks like I need to read things in one-at-a-time to be safe.
Nate
A: 

If you're writing OO code that requires the knowledge of the inner workings of a class, you're doing it wrong. You should assume nothing about the inner workings of the class; you should only assume that the methods and properties work the same on any platform/compiler.

You would probably be better off implementing a class that emulates the functionality of the vector (perhaps by subclassing the vector). Acting perhaps as a "proxy pattern" implementation, it could load only those structures that have been accessed by the caller. This would allow you to deal with any endian issues at the same time as well. This way should make it work for any platform or compiler.

Nathan