views:

255

answers:

6

I'm looking into ways of formally specifying format for various binary streams and using a tool to check streams for compliance with specification. Something like XSD+any of validation tools for XML. Or like extremely complicate grep expression working on a binary level (preferably not - that would really be hard to read).

Does anybody know of a specification/tool that would be useful?

[Rationale: We are receiving many 3rd party generated binary files on a daily basis and many times they are using bad tools that produce invalid files. We want to give them a tool which they could use as a validator and we don't want to write a specific tool for each format.]

A: 

I think a good example is the specification of Java's .class files: http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html

Itay
A: 

Abstract Syntax Notation One: ASN.1. See also the NCBI Toolbox: http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html

Pierre
Interesting, but maybe a little too heavyweight for us. Plus I always hated the specs that are not freely available.
gabr
Pierre
+1  A: 

This is an interesting question, but I would be very suprised if such a specification language exists. This is because the meta-structure possibilities of binary files are effectively infinite. Compare this with XML, where the meta-structure (tags contain other tags, only one attribute can have one name, etc.) is strictly specified. And even with that structure, writing schemas for XML is hard! The only way I can see of dealing with the infinite possibilities of binary file formats is to use something that itself allows infinite variability - a Turing-complete programming language.

This is of course not to say that for your specific problem domain a useful specification language and a processor for it could not be produced. I just think you'll have a hard time finding a pre-built one. I hope answers here prove me wrong!

anon
Yeah, your thoughts exactly echo mine. But I promised my boss to research the problem, so ...
gabr
+3  A: 

give a try to Preon:

  • annotation driven
  • conditionals parts
  • expression language

each annotated class is a Codec description that is capable to generate both an Encoder and a Decoder.

dfa
Thanks, but as we're not using the Java, this seems to be of little use. If I'm reading it correctly, I'd have to reimplement each protocol as a set of java classes and then use Preon to generate decoder, which I could then use for testing. I'm looking for something more formal and closer to the binary level (i.e. I'd like to start from bits up, not vice versa). That way, I could use the same tool to test our stream generators.
gabr
check also the google protocol buffer DSL: http://code.google.com/apis/protocolbuffers/docs/overview.html
dfa
This could potentially be useful. Please add it as a separate answer so I can vote it up :)
gabr
+1  A: 

check also Google Protocol Buffers:

  • Java/Python/C++ APIs
  • nice DSL
dfa
+3  A: 

If you think Java's .class files documentation is a good example of a specification, reconsider looking at Preon. Preon is capturing it entirely, and generates documentation like this.

There are actually a couple of other initiatives for capturing the 'syntax' of binary encoded files. ASN.1 is useful, but it doesn't give you a lot of mileage if you intent to capture - say - Java class files. The same holds for BSDL, Flavor, BFlavor and a couple of other initiatives. Problem is: there are a million ways to encode binary data, lots of binary compression techniques, and I think that means there will never be something that captures it entirely, unless the language itself is extensible.

Google protocol buffers basically has the same problem. It defines something like Corba's CDR, and it's good, as long as you don't need something more advanced. Google protocol buffers is not going to allow you to capture Java's class file format.

Wilfred Springer