tags:

views:

55

answers:

5

I'm reading binary data from a file, specifically from a zip file. (To know more about the zip format structure see http://en.wikipedia.org/wiki/ZIP_%28file_format%29)

I've created a struct that stores the data:

typedef struct {
                                            /*Start Size            Description                                 */
    int signatute;                          /*   0  4   Local file header signature = 0x04034b50                */
    short int version;                      /*   4  2   Version needed to extract (minimum)                     */
    short int bit_flag;                     /*   6  2   General purpose bit flag                                */
    short int compression_method;           /*   8  2   Compression method                                      */
    short int time;                         /*  10  2   File last modification time                             */
    short int date;                         /*  12  2   File last modification date                             */
    int crc;                                /*  14  4   CRC-32                                                  */
    int compressed_size;                    /*  18  4   Compressed size                                         */
    int uncompressed_size;                  /*  22  4   Uncompressed size                                       */
    short int name_length;                  /*  26  2   File name length (n)                                    */
    short int extra_field_length;           /*  28  2   Extra field length (m)                                  */
    char *name;                             /*  30  n   File name                                               */
    char *extra_field;                      /*30+n  m   Extra field                                             */

} ZIP_local_file_header;

The size returned by sizeof(ZIP_local_file_header) is 40, but if the sum of each field is calculated with sizeof operator the total size is 38.

If we have the next struct:

typedef struct {
    short int x;
    int y;
} FOO;

sizeof(FOO) returns 8 because the memory is allocated with 4 bytes every time. So, to allocate x are reserved 4 bytes (but the real size is 2 bytes). If we need another short int it will fill the resting 2 bytes of the previous allocation. But as we have an int it will be allocated plus 4 bytes and the empty 2 bytes are wasted.

To read data from file, we can use the function fread:

ZIP_local_file_header p;
fread(&p,sizeof(ZIP_local_file_header),1,file);

But as there're empty bytes in the middle, it isn't read correctly.

What can I do to sequentially and efficiently store data with ZIP_local_file_header wasting no bytes?

+2  A: 

The solution is compiler-specific, but for instance in GCC, you can force it to pack the structure more tightly by appending __attribute__((packed)) to the definition. See http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/Type-Attributes.html.

Oli Charlesworth
A: 

Also, the name and extra_field will not contain any meaningful data, most likely. At least not between runs of the program, since these are pointers.

Amigable Clark Kant
I know it, but my problem is because I have 5 `short int` and the memory allocated is 8 bytes, but only 6 are used.
Ricardo
+3  A: 

C structs are just about grouping related pieces of data together, they do not specify a particular layout in memory. (Just as the width of an int isn't defined either.) Little-endian/Big-endian is also not defined, and depends on the processor.

Different compilers, the same compiler on different architectures or operating systems, etc., will all layout structs differently.

As the file format you want to read is defined in terms of which bytes go where, a struct, although it looks very convenient and tempting, isn't the right solution. You need to treat the file as a char[] and pull out the bytes you need and shift them in order to make numbers composed of multiple bytes, etc.

Adrian Smith
+1 for proposing the portable solution.
Amardeep
It is the solution that I've. But it makes the reading more complex and dependent of the structure.
Ricardo
@Adrian: struct members will be laid out in the order they are declared. From 6.7.2.1, para 13: "Within a structure object, the non-bit-field members and the units in which bit-fieldsreside have addresses that *increase in the order in which they are declared*. A pointer to a structure object, suitably converted, points to its initial member (or if that member is abit-field, then to the unit in which it resides), and vice versa. There may be unnamed padding within a structure object, but not at its beginning." Emphasis mine.
John Bode
Ah OK, I didn't realize that, thanks!
Adrian Smith
+1  A: 

In order to meet the alignment requirements of the underlying platform, structs may have "padding" bytes between members so that each member starts at a properly aligned address.

There are several ways around this: one is to read each element of the header separately using the appropriately-sized member:

fread(&p.signature, sizeof p.signature, 1, file);
fread(&p.version, sizeof p.version, 1, file);
...

Another is to use bit fields in your struct definition; these are not subject to padding restrictions. The downside is that bit fields must be unsigned int or int or, as of C99, _Bool; you may have to cast the raw data to the target type to interpret it correctly:

typedef struct {                 
    unsigned int signature          : 32;
    unsigned int version            : 16;                
    unsigned int bit_flag;          : 16;                
    unsigned int compression_method : 16;              
    unsigned int time               : 16;
    unsigned int date               : 16;
    unsigned int crc                : 32;
    unsigned int compressed_size    : 32;                 
    unsigned int uncompressed_size  : 32;
    unsigned int name_length        : 16;    
    unsigned int extra_field_length : 16;
} ZIP_local_file_header;

You may also have to do some byte-swapping in each member if the file was written in big-endian but your system is little-endian.

Note that name and extra field aren't part of the struct definition; when you read from the file, you're not going to be reading pointer values for the name and extra field, you're going to be reading the actual contents of the name and extra field. Since you don't know the sizes of those fields until you read the rest of the header, you should defer reading them until after you've read the structure above. Something like

ZIP_local_file_header p;
char *name = NULL;
char *extra = NULL;
...
fread(&p, sizeof p, 1, file);
if (name = malloc(p.name_length + 1))
{
    fread(name, p.name_length, 1, file);
    name[p.name_length] = 0;
}
if (extra = malloc(p.extra_field_length + 1))
{
    fread(extra, p.extra_field_length, 1, file);
    extra[p.extra_field_length] = 0;
}
John Bode
Very good explanation. But if I pass a pointer from structure to function and use the address of the field, I've got an error:zip.c:42:2: error: cannot take address of bit-field ‘signature’ zip.c:42:2: error: ‘sizeof’ applied to a bit-field
Ricardo
@Ricardo - you should either pass pointers to struct members as defined in your *original* struct type **or** use bitfields and pass the address of the entire struct. You cannot take the address of a bit field.
John Bode
+1  A: 

It's been a while since I worked with zip-compressed files, but I do remember the practice of adding my own padding to hit the 4-byte alignment rules of PowerPC arch.

At best you simply need to define each element of your struct to the size of the piece of data you want to read in. Don't just use 'int' as that may be platform/compiler defined to various sizes.

Do something like this in a header:

typedef unsigned long   unsigned32;
typedef unsigned short  unsigned16;
typedef unsigned char   unsigned8;
typedef unsigned char   byte;

Then instead of just int use an unsigned32 where you have a known 4-byte vaule. And unsigned16 for any known 2-byte values.

This will help you see where you can add padding bytes to hit 4-byte alignment, or where you can group 2, 2-byte elements to make up a 4-byte alignment.

Ideally you can use a minimum of padding bytes (which can be used to add additional data later as your expand the program) or none at all if you can align everything to 4-byte boundaries with variable-length data at the end.

ExitToShell