views:

4336

answers:

9

Are there any libraries or guides for how to read and parse binary data in C?

I am looking at some functionality that will receive TCP packets on a network socket and then parse that binary data according to a specification, turning the information into a more useable form by the code.

Are there any libraries out there that do this, or even a primer on performing this type of thing?

+4  A: 

You don't really need to parse binary data in C, just cast some pointer to whatever you think it should be.

struct SomeDataFormat
{
    ....
}

SomeDataFormat* pParsedData = (SomeDataFormat*) pBuffer;

Just be wary of endian issues, type sizes, reading off the end of buffers, etc etc

Gwaredd
Or different compilers, etc. That's *really* fragile code, IMO.
Jon Skeet
Agreed. I think all the many etc's are why he wants a library to do it.
Sam Hoice
yeah -.- while that approach is reasonable as long as you are on the same machine, doing that in network programming should really be avoided.
Johannes Schaub - litb
the data structure should be defined as packed on both ends and after taking care endian issues your are perfectly safe.
Ilya
At least some of the time, the structures you need are already defined---and in a way that is locally correct----in a set of API headers somewhere. This is common for OS services, etc.
dmckee
If you must write it yourself, looking into what #pragma's your compiler provides to insure that you get the packing and endian right.
dmckee
Extremely dangerous method when you read packets on the network: it would make very easy for an attacker to send you severly invalid data.
bortzmeyer
I've used this on multiple projects and never had any problems. You need to watch for packing and alignment issues - easy fix with a pragma. Endian and invalid/dangerous data you need to check for whatever method you use. Serialisation is (a) slower (b) over engineering imo (in most cases)
Gwaredd
@Gwaredd - Did you use the same method even for network programming ???
codingfreak
Please do not just cast structs, it is VERY fragile, see Casey Barker's response for the proper way of doing it.
kalleh
+3  A: 

You might be interested in Google Protocol Buffers, which is basically a serialization framework. It's primarily for C++/Java/Python (those are the languages supported by Google) but there are ongoing efforts to port it to other languages, including C. (I haven't used the C port at all, but I'm responsible for one of the C# ports.)

Jon Skeet
There are many ways to serialize data (Protocol Buffers is nice but it is just one of it, there is also XML, JSON, ASN/1+BER, etc). They work only if you control the specification of the protocol. If it is not the case, your method does not work.
bortzmeyer
Absolutely. If you're not in control of the protocol, you basically have to do it manually.
Jon Skeet
+1  A: 

I'm not really understand what kind of library you are looking for ? Generic library that will take any binary input and will parse it to unknown format? I'm not sure there is such library can ever exist in any language. I think you need elaborate your question a little bit.

Edit:
Ok, so after reading Jon's answer seems there is a library, well kind of library it's more like code generation tool. But as many stated just casting the data to the appropriate data structure, with appropriate carefulness i.e using packed structures and taking care of endian issues you are good. Using such tool with C it's just an overkill.

Ilya
+8  A: 

The standard way to do this in C/C++ is really casting to structs as 'gwaredd' suggested

It is not as unsafe as one would think. You first cast to the struct that you expected, as in his/her example, then you test that struct for validity. You have to test for max/min values, termination sequences, etc.

Checkout http://mixter.void.ru/rawip.html for a quick example using raw sockets.

What ever platform you are on you must read Unix Network Programming, Volume 1: The Sockets Networking API. Buy it, borrow it, steal it ( the victim will understand, it's like stealing food or something... ), but do read it.

After reading the Stevens, most of this will make a lot more sense.

kervin
I'm skeptical of the method "cast then check". If you don't check, you risk getting invalid data. And if you check, what's the point of casting? Checking will be as slow as traditional parsing.
bortzmeyer
+2  A: 

Parsing/formatting binary structures is one of the very few things that is easier to do in C than in higher-level/managed languages. You simply define a struct that corresponds to the format you want to handle and the struct is the parser/formatter. This works because a struct in C represents a precise memory layout (which is, of course, already binary). See also kervin's and gwaredd's replies.

Matt Campbell
+1  A: 

Basically suggestions about casting to struct work but please be aware that numbers can be represented differently on different architectures.

To deal with endian issues network byte order was introduced - common practice is to convert numbers from host byte order to network byte order before sending the data and to convert back to host order on receipt. See functions htonl, htons, ntohl and ntohs.

And really consider kervin's advice - read UNP. You won't regret it!

qrdl
+4  A: 

Let me restate your question to see if I understood properly. You are looking for software that will take a formal description of a packet and then will produce a "decoder" to parse such packets?

If so, the reference in that field is PADS. A good article introducing it is PADS: A Domain-Specific Language for Processing Ad Hoc Data. PADS is very complete but unfortunately under a non-free licence.

There are possible alternatives (I did not mention non-C solutions). Apparently, none can be regarded as completely production-ready:

If you read French, I summarized these issues in Génération de décodeurs de formats binaires.

bortzmeyer
@bortzmeyer These are all news to me. Thanks for the info!
Bklyn
+3  A: 

In my experience, the best way is to first write a set of primitives, to read/write a single value of some type from a binary buffer. This gives you high visibility, and a very simple way to handle any endianness-issues: just make the functions do it right.

Then, you can for instance define structs for each of your protocol messages, and write pack/unpack (some people call them serialize/deserialize) functions for each.

As a base case, a primitive to extract a single 8-bit integer could look like this (assuming an 8-bit char on the host machine, you could add a layer of custom types to ensure that too, if needed):

const void * read_uint8(const void *buffer, unsigned char *value)
{
  const unsigned char *vptr = buffer;
  *value = *buffer++;
  return buffer;
}

Here, I chose to return the value by reference, and return an updated pointer. This is a matter of taste, you could of course return the value and update the pointer by reference. It is a crucial part of the design that the read-function updates the pointer, to make these chainable.

Now, we can write a similar function to read a 16-bit unsigned quantity:

const void * read_uint16(const void *buffer, unsigned short *value)
{
  unsigned char lo, hi;

  buffer = read_uint8(buffer, &hi);
  buffer = read_uint8(buffer, &lo);
  *value = (hi << 8) | lo;
  return buffer;
}

Here I assumed incoming data is big-endian, this is common in networking protocols (mainly for historical reasons). You could of course get clever and do some pointer arithmetic and remove the need for a temporary, but I find this way makes it clearer and easier to understand. Having maximal transparency in this kind of primitive can be a good thing when debugging.

The next step would be to start defining your protocol-specific messages, and write read/write primitives to match. At that level, think about code generation; if your protocol is described in some general, machine-readable format, you can generate the read/write functions from that, which saves a lot of grief. This is harder if the protocol format is clever enough, but often doable and highly recommended.

unwind
+6  A: 

I have to disagree with many of the responses here. I strongly suggest you avoid the temptation to cast a struct onto the incoming data. It seems compelling and might even work on your current target, but if the code is ever ported to another target/environment/compiler, you'll run into trouble. A few reasons:

Endianness: The architecture you're using right now might be big-endian, but your next target might be little-endian. Or vice-versa. You can overcome this with macros (ntoh and hton, for example), but it's extra work and you have make sure you call those macros every time you reference the field.

Alignment: The architecture you're using might be capable of loading a mutli-byte word at an odd-addressed offset, but many architectures cannot. If a 4-byte word straddles a 4-byte alignment boundary, the load may pull garbage. Even if the protocol itself doesn't have misaligned words, sometimes the byte stream itself is misaligned. (For example, although the IP header definition puts all 4-byte words on 4-byte boundaries, often the ethernet header pushes the IP header itself onto a 2-byte boundary.)

Padding: Your compiler might choose to pack your struct tightly with no padding, or it might insert padding to deal with the target's alignment constraints. I've seen this change between two versions of the same compiler. You could use #pragmas to force the issue, but #pragmas are, of course, compiler-specific.

Bit Ordering: The ordering of bits inside C bitfields is compiler-specific. Plus, the bits are hard to "get at" for your runtime code. Every time you reference a bitfield inside a struct, the compiler has to use a set of mask/shift operations. Of course, you're going to have to do that masking/shifting at some point, but best not to do it at every reference if speed is a concern. (If space is the overriding concern, then use bitfields, but tread carefully.)

All this is not to say "don't use structs." My favorite approach is to declare a friendly native-endian struct of all the relevant protocol data without any bitfields and without concern for the issues, then write a set of symmetric pack/parse routines that use the struct as a go-between.

typedef struct _MyProtocolData
{
    Bool myBitA;  // Using a "Bool" type wastes a lot of space, but it's fast.
    Bool myBitB;
    Word32 myWord;  // You have a list of base types like Word32, right?
} MyProtocolData;

Void myProtocolParse(const Byte *pProtocol, MyProtocolData *pData)
{
    // Somewhere, your code has to pick out the bits.  Best to just do it one place.
    pData->myBitA = *(pData + MY_BITS_OFFSET) & MY_BIT_A_MASK >> MY_BIT_A_SHIFT;
    pData->myBitB = *(pData + MY_BITS_OFFSET) & MY_BIT_B_MASK >> MY_BIT_B_SHIFT;

    // Endianness and Alignment issues go away when you fetch byte-at-a-time.
    // Here, I'm assuming the protocol is big-endian.
    // You could also write a library of "word fetchers" for different sizes and endiannesses.
    pData->myWord  = *(pData + MY_WORD_OFFSET + 0) << 24;
    pData->myWord += *(pData + MY_WORD_OFFSET + 1) << 16;
    pData->myWord += *(pData + MY_WORD_OFFSET + 2) << 8;
    pData->myWord += *(pData + MY_WORD_OFFSET + 3);

    // You could return something useful, like the end of the protocol or an error code.
}

Void myProtocolPack(const MyProtocolData *pData, Byte *pProtocol)
{
    // Exercise for the reader!  :)
}

Now, the rest of your code just manipulates data inside the friendly, fast struct objects and only calls the pack/parse when you have to interface with a byte stream. There's no need for ntoh or hton, and no bitfields to slow down your code.

Casey Barker
Does this code work even for passing a structure via sockets >>>
codingfreak
It's expressly good for sockets -- especially when you don't want to make assertions about the endianness/bus width/alignment of the processes on either end of the socket.
Casey Barker