views:

268

answers:

6

I'm reading from a byte stream that contains a series of variable length descriptors which I'm representing as various structs/classes in my code. Each descriptor has a fixed length header in common with all the other descriptors, which are used to identify its type.

Is there an appropriate model or pattern I can use to best parse and represent each descriptor, and then perform an appropriate action depending on it's type?

+8  A: 

I've written lots of these types of parser.

I recommend that you read the fixed length header, and then dispatch to the correct constructor to your structures using a simple switch-case, passing the fixed header and stream to that constructor so that it can consume the variable part of the stream.

Will
what exactly is wrong with this answer? this is the best one yet!
Matt Joiner
+1 for simplicity and ease of implementation.
Matt Davison
Agreed, +1, this is simple and concise.
Alex Martelli
+1  A: 

This sounds like it might be a job for the Factory Method or perhaps Abstract Factory. Based on the header you choose which factory method to call, and that returns an object of the relevant type.

Whether this is better than simply adding constructors to a switch statement depends on the complexity and the uniformity of the objects you're creating.

Kylotan
+2  A: 

This is a common problem in file parsing. Commonly, you read the known part of the descriptor (which luckily is fixed-length in this case, but isn't always), and branch it there. Generally I use a strategy pattern here, since I generally expect the system to be broadly flexible - but a straight switch or factory may work as well.

The other question is: do you control and trust the downstream code? Meaning: the factory / strategy implementation? If you do, then you can just give them the stream and the number of bytes you expect them to consume (perhaps putting some debug assertions in place, to verify that they do read exactly the right amount).

If you can't trust the factory/strategy implementation (perhaps you allow the user-code to use custom deserializers), then I would construct a wrapper on top of the stream (example: SubStream from protobuf-net), that only allows the expected number of bytes to be consumed (reporting EOF afterwards), and doesn't allow seek/etc operations outside of this block. I would also have runtime checks (even in release builds) that enough data has been consumed - but in this case I would probably just read past any unread data - i.e. if we expected the downstream code to consume 20 bytes, but it only read 12, then skip the next 8 and read our next descriptor.

To expand on that; one strategy design here might have something like:

interface ISerializer {
    object Deserialize(Stream source, int bytes);
    void Serialize(Stream destination, object value);
}

You might build a dictionary (or just a list if the number is small) of such serializers per expected markers, and resolve your serializer, then invoke the Deserialize method. If you don't recognise the marker, then (one of):

  • skip the given number of bytes
  • throw an error
  • store the extra bytes in a buffer somewhere (allowing for round-trip of unexpected data)

As a side-note to the above - this approach (strategy) is useful if the system is determined at runtime, either via reflection or via a runtime DSL (etc). If the system is entirely predictable at compile-time (because it doesn't change, or because you are using code-generation), then a straight switch approach may be more appropriate - and you probably don't need any extra interfaces, since you can inject the appropriate code directly.

Marc Gravell
A: 

I would suggest:


fifo = Fifo.new

while(fd is readable) {
  read everything off the fd and stick it into fifo
  if (the front of the fifo is has a valid header and 
      the fifo is big enough for payload) {

      dispatch constructor, remove bytes from fifo
  }
}

With this method:

  • you can do some error checking for bad payloads, and potentially throw bad data away
  • data is not waiting on the fd's read buffer (can be an issue for large payloads)
Dave
a good suggestion for input streams that won't allow random access like sockets and such
Matt Joiner
+1  A: 

One key thing to remember, if you're reading from the stream and do not detect a valid header/message, throw away only the first byte before trying again. Many times I've seen a whole packet or message get thrown away instead, which can result in valid data being lost.

qid
a useful suggestion
Matt Joiner
A: 

If you'd like it to be nice OO, you can use the visitor pattern in an object hierarchy. How I've done it was like this (for identifying packets captured off the network, pretty much the same thing you might need):

  • huge object hierarchy, with one parent class

  • each class has a static contructor that registers with its parent, so the parent knows about its direct children (this was c++, I think this step is not needed in languages with good reflection support)

  • each class had a static constructor method that got the remaining part of the bytestream and based on that, it decided if it is his responsibility to handle that data or not

  • When a packet came in, I've simply passed it to static constructor method of the main parent class (called Packet), which in turn checked all of its children if it's their responsibility to handle that packet, and this went recursively, until one class at the bottom of the hierarchy returned the instantiated class back.

  • Each of the static "constructor" methods cut its own header from the bytestream and passed down only the payload to its children.

The upside of this approach is that you can add new types anywhere in the object hierarchy WITHOUT needing to see/change ANY other class. It worked remarkably nice and well for packets; it went like this:

  • Packet
  • EthernetPacket
  • IPPacket
  • UDPPacket, TCPPacket, ICMPPacket
  • ...

I hope you can see the idea.

Agoston Horvath