views:

426

answers:

8

I want to come up with a binary format for passing data between application instances in a form of POFs (Plain Old Files ;)).

Prerequisites:

  1. should be cross-platform
  2. information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
  3. only sequential access is required
  4. should be a way to check data consistency
  5. should be small and fast
  6. should prevent an average user with archiver + notepad from modifying the data

Currently I'm using DeflaterOutputStream + OutputStreamWriter together with InflaterInputStream + InputStreamReader to save/restore objects serialized with XStream, one object per file. Readers/Writers use UTF8. Now, need to extend this to support the previously described. My idea of format:

{serialized to XML object}
{delimiter}
{String file name}{delimiter}{byte[] file data}
{delimiter}
{another String file name}{delimiter}{another byte[] file data}
...
{delimiter}
{delimiter}
{MD5 hash for the entire file}
  1. Does this look sane?
  2. What would you use for a delimiter and how would you determine it?
  3. The right way to calculate MD5 in this case?
  4. What would you suggest to read on the subject?

TIA.

+2  A: 

Would serialization of the model (if you are into MVC) not be another way? I'd prefer to use things in the language (or standard libraries) rather then roll my own if possible. The only issue I can see with that is that the file size may be larger than you want.

TofuBeer
Edited to add cross-platform.
alex
When you say "cross platform" do you mean cross language? Java Serialization is cross platform as long as you stick with Java.
TofuBeer
+1  A: 

You could use a zip (rar / 7z / tar.gz / ...) library. Many exists, most are well tested and it'll likely save you some time.

Possibly not as much fun though.

Barend
No fun at all :)
alex
Java has its own format for compressed files called jar. ;)
Peter Lawrey
Yeah, it has zip btw, if you want to see a horror movie, look the way 7z sdk is implemented :D
alex
+2  A: 

1) Does this look sane?

It looks fairly sane. However, if you are going to invent your own format rather than just using Java serialization then you should have a good reason. Do you have any good reasons (they do exist in some cases)? One of the standard reasons for using XStream is to make the result human readable, which a binary format immediately loses. Do you have a good reason for a binary format rather than a human readable one? See this question for why human readable is good (and bad).

Wouldn't it be easier just to put everything in a signed jar. There are already standard Java libraries and tools to do this, and you get compression and verification provided.

2) What would you use for a delimiter and how determine it?

Rather than a delimiter I'd explicitly store the length of each block before the block. It's just as easy, and prevents you having to escape the delimiter if it comes up on its own.

3) The right way to calculate MD5 in this case?

There is example code here which looks sensible.

4) What would you suggest to read on the subject?

On the subject of serialization? I'd read about the Java serialization, JSON, and XStream serialization so I understood the pros and cons of each, especially the benefits of human readable files. I'd also look at a classic file format, for example from Microsoft, to understand possible design decisions from back in the days that every byte mattered, and how these have been extended. For example: The WAV file format.

Nick Fortescue
1)The reasoning: a)should be cross-language (hence inflation + xml); doesn't really matter whether it's human-readable or not, size does matter though b)jar signing won't work as i can't use an external tool
alex
c)should be a cheap (in terms of memory/cpu cycles) way to remove the byte[]s leaving the xml intact
alex
You can sign programmatically (though not trivially): http://www.onjava.com/pub/a/onjava/2001/04/12/signing_jar.html
Nick Fortescue
A: 

Bencode could be the way to go.

Here's an excellent implementation by Daniel Spiewak.

Unfortunately, bencode spec doesn't support utf8 which is a showstopper for me.

Might come to this later but currently xml seems like a better choice (with blobs serialized as a Map).

alex
A: 

Perhaps you could explain how this is better than using an existing file format such as JAR.

Most standard files formats of this type just use CRC as its faster to calculate. MD5 is more appropriate if you want to prevent deliberate modification.

Peter Lawrey
Sure1. Can be easily modified (edited the firs post on this requirement)2. Slightly worse compression3. Irrational personal preference :)
alex
+1  A: 

It looks INsane.

  • why invent a new file format?
  • why try to prevent only stupid users from changing file?
  • why use a binary format ( hard to compress ) ?
  • why use a format that cannot be parsed while being received? (receiver has to receive entire file before being able to act on the file. )
  • XML is already a serialization format that is compressable. So you are serializing a serialized format.
Pat
After some fussing and fighting I have to agree. Gone xml.
alex
+1  A: 

Let's see this should be pretty straightforward.

Prerequisites:

0. should be cross-platform

1. information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])

2. only sequential access is required

3. should be a way to check data consistency

4. should be small and fast

5. should prevent an average user with archiver + notepad from modifying the data

Well guess what, you pretty much have it already, it's built-in the platform already:Object Serialization

If you need to reduce the amount of data sent in the wire and provide a custom serialization ( for instance you can sent only 1,2,3 for a given object without using the attribute name or nothing similar, and read them in the same sequence, ) you can use this somehow "Hidden feature"

If you really need it in "text plain" you can also encode it, it takes almost the same amount of bytes.

For instance this bean:

import java.io.*;
public class SimpleBean implements Serializable  { 
    private String website = "http://stackoverflow.com";
    public String toString() { 
        return website;
    }
}

Could be represented like this:

rO0ABXNyAApTaW1wbGVCZWFuPB4W2ZRCqRICAAFMAAd3ZWJzaXRldAASTGphdmEvbGFuZy9TdHJpbmc7eHB0ABhodHRwOi8vc3RhY2tvdmVyZmxvdy5jb20=

See this answer

Additionally, if you need a sounded protocol you can also check to Protobuf, Google's internal exchange format.

OscarRyz
+1  A: 

I agree in that it doesn't really sound like you need a new format, or a binary one. If you truly want a binary format, why not consider one of these first:

  • Binary XML (fast infoset, Bnux)
  • Hessian
  • google packet buffers

But besides that, many textual formats should work just fine (or perhaps better) too; easier to debug, extensive tool support, compresses to about same size as binary (binary compresses poorly, and information theory suggests that for same effective information, same compression rate is achieved -- and this has been true in my testing).

So perhaps also consider:

So it kind of sounds like you just want to build something of your own. Nothing wrong with that, as a hobby, but if so you need to consider it as such. It likely is not a requirement for the system you are building.

StaxMan