views:

124

answers:

3

For various reasons I have a custom serialization where I am dumping some fairly simple objects to a data file. There are maybe 5-10 classes, and the object graphs that result are acyclic and pretty simple (each serialized object has 1 or 2 references to another that are serialized). For example:

class Foo
{
    final private long id;
    public Foo(long id, /* other stuff */) { ... }
}

class Bar
{
    final private long id;
    final private Foo foo;
    public Bar(long id, Foo foo, /* other stuff */) { ... }
}

class Baz
{
    final private long id;
    final private List<Bar> barList;
    public Baz(long id, List<Bar> barList, /* other stuff */) { ... }
}

The id field is just for the serialization, so that when I am serializing to a file, I can write objects by keeping a record of which IDs have been serialized so far, then for each object checking whether its child objects have been serialized and writing the ones that haven't, finally writing the object itself by writing its data fields and the IDs corresponding to its child objects.

What's puzzling me is how to assign id's. I thought about it, and it seems like there are three cases for assigning an ID:

  • dynamically-created objects -- id is assigned from a counter that increments
  • reading objects from disk -- id is assigned from the number stored in the disk file
  • singleton objects -- object is created prior to any dynamically-created object, to represent a singleton object that is always present.

How can I handle these properly? I feel like I'm reinventing the wheel and there must be a well-established technique for handling all the cases.


clarification: just as some tangential information, the file format I am looking at is approximately the following (glossing over a few details which should not be relevant). It's optimized to handle a fairly large amount of dense binary data (tens/hundreds of MB) with the ability to intersperse structured data in it. The dense binary data makes up 99.9% of the file size.

The file consists of a series of error-corrected blocks which serve as containers. Each block can be thought of as containing a byte array which consists of a series of packets. It is possible to read the packets one at a time in succession (e.g. it's possible to tell where the end of each packet is, and the next one starts immediately afterwards).

So the file can be thought of as a series of packets stored on top of an error-correcting layer. The vast majority of these packets are opaque binary data that has nothing to do with this question. A small minority of these packets, however, are items containing serialized structured data, forming a sort of "archipelago" consisting of data "islands" which may be linked by object reference relationships.

So I might have a file where packet 2971 contains a serialized Foo, and packet 12083 contains a serialized Bar that refers to the Foo in packet 2971. (with packets 0-2970 and 2972-12082 being opaque data packets)

All these packets are all immutable (and therefore given the constrains of Java object construction, they form an acyclic object graph) so I don't have to deal with mutability issues. They are also descendents of a common Item interface. What I would like to do is write an arbitrary Item object to the file. If the Item contains references to other Items that are already in the file, I need to write those to the file too, but only if they haven't been written yet. Otherwise I will have duplicates that I will need to somehow coalesce when I read them back.

+1  A: 

Are the foos registered with a FooRegistry? You could try this approach (assume Bar and Baz also have registries to acquire the references via the id).

This probably has lots of syntax errors, usage errors, etc. But I feel the approach is a good one.

public class Foo {

public Foo(...) {
    //construct
    this.id = FooRegistry.register(this);
}

public Foo(long id, ...) {
    //construct
    this.id = id;
    FooRegistry.register(this,id);
}

}

public class FooRegistry() { Map foos = new HashMap...

long register(Foo foo) {
    while(foos.get(currentFooCount) == null) currentFooCount++;
    foos.add(currentFooCount,foo);
    return currentFooCount;
}

void register(Foo foo, long id) {
    if(foo.get(id) == null) throw new Exc ... // invalid
    foos.add(foo,id);
}

}

public class Bar() {

void writeToStream(OutputStream out) {
    out.print("<BAR><id>" + id + "</id><foo>" + foo.getId() + "</foo></BAR>");
}

}

public class Baz() {

void.writeToStream(OutputStream out) {
    out.print("<BAZ><id>" + id + "</id>");
    for(Bar bar : barList) out.println("<bar>" + bar.getId() + </bar>");
    out.print("</BAZ>");
}

}

glowcoder
+2  A: 

Do you really need to do this? Internally, the ObjectOutputStream tracks which objects have been serialized already. Subsequent writes of the same object only store a internal reference (similar to writing out just the id) rather than writing out the whole object again.

See Serialization Cache for more details.

If the IDs correspond to some externally defined identity, such as an entity ID, then this makes sense. But the question states that the IDs are generated purely to track which objects are serialized.

You can handle singletons via the readResolve method. A simple approach is to compare the freshly deserialized instance with your singleton instances, and if there is a match, return the singleton instance rather than the deserialized instance. E.g.

   private Object readResolve() {
      return (this.equals(SINGLETON)) ? SINGLETON : this;
      // or simply
      // return SINGLETON;
   }

EDIT: In response to the comments, the stream is mostly binary data (stored in an optimized format) with complex objects indispersed in that data. This can be handled by using a stream format that supports substreams, e.g. zip, or a simple block chunking. E.g. the stream can be a sequence of blocks:

offset 0  - block type
offset 4  - block length N
offset 8  - N bytes of data
...
offset N+8  start of next block

You can then have blocks for binary data, blocks for serialized data, blocks for XStream serialized data etc. Since each block knows it's size you can create a substream to read up to that length from the place in the file. This allows you to freely mix data without concerns for parsing.

To implement a stream, have your main stream parse the blocks, e.g.

   DataInputStream main = new DataInputStream(input);
   int blockType = main.readInt();
   int blockLength = main.readInt();
   // next N bytes are the data
   LimitInputStream data = new LimitInputStream(main, blockLength);

   if (blockType==BINARY) {
      handleBinaryBlock(new DataInputStream(data));
   }
   else if (blockType==OBJECTSTREAM) {
      deserialize(new ObjectInputStream(data));
   }
   else
      ...

A sketch of LimitInputStream looks like this:

public class LimitInputStream extends FilterInputStream
{
   private int bytesRead;
   private int limit;
   /** Reads up to limit bytes from in */
   public LimitInputStream(InputStream in, int limit) {
      super(in);
      this.limit = limit;
   }

   public int read(byte[] data, int offs, int len) throws IOException {
      if (len==0) return 0; // read() contract mandates this
      if (bytesRead==limit)
         return -1;
      int toRead = Math.min(limit-bytesRead, len);
      int actuallyRead = super.read(data, offs, toRead);
      if (actuallyRead==-1)
          throw new UnexpectedEOFException();
      bytesRead += actuallyRead;
      return actuallyRead;
   }

   // similarly for the other read() methods

   // don't propagate to underlying stream
   public void close() { }
}
mdma
+1 for making the point.... Do I really need to do this? I'd love to use some facility built into the JRE, but there are so many differences between ObjectOutputStream and what I'm doing that I don't know how to link the two together. My serialization is closer to XML serialization.
Jason S
Have you tried XStream - http://xstream.codehaus.org. It's serialization but based on XML. Very pluggable. It also uses a serialization cache - references to already serialized objects are written out as references in XML, either referring to an automatically generated id, or using XPath to refer to the original element that defined the object. Well worth a look.
mdma
I actually did take a look a few minutes before posting a comment. My problem in this particular case, is that I need to intersperse a few complex objects among a large set of binary-encoded raw data bytes that need to be stored in an optimized way since they use 99.9% of the file's space and I'm expecting files in the 10-100MB range. So I can't use XML... all I have are a bunch of disconnected islands among a larger data stream.
Jason S
XStream allows you to completely replace the actual file format, so you could use FastInfoset or some other binary standard. I'm assuming that your file format allows you to get hold of the data islands, and treat this as "substreams" of the main stream. Then you could store whatever you want in there, XML, FastInfoSet, protocol buffers etc. Just because the rest of your file is optimized binary, doesn't mean that all of it has to be. You can use chunking to split the data islands from the remainder of the stream. I'll elaborate more in my answer.
mdma
dumb question... how do you implement a substream?
Jason S
(e.g. each block does know its own length, I'm doing that)
Jason S
Not a dumb question - I've upaded my answer.
mdma
OK, I see what you're getting at. Maybe "islands" was a bad term; really what I have is a data "archipelago". I will update my question to clarify.
Jason S
Accepted... I ended up keeping the data as "islands" and using Google gson to encode each one in a JSON notation. I have the possibility of duplicating some of the objects in the data file, but they're such a small part of the file size that it doesn't matter for file size, and if I care about object graph equivalence, I can coalesce multiple copies of equivalent objects upon reading them out from the file.
Jason S
This sounds good. I was going to propose extending ObjectOutputStream so that it writes the data packets that belong to an object after streaming the object. This will then preserve the object graph, with no duplicates, while allowing each object to write out the data that belongs to it.
mdma
A: 

I feel like I'm reinventing the wheel and there must be a well-established technique for handling all the cases.

Yes, looks like default object serialization would do, or ultimately you're pre-optimizing.

You can change the format of the serialized data ( like the XMLEncoder does ) for a more convenient one.

But if you insist, I think the singleton with dynamic counter should do, but don't put the id, in the public interface for the constructor:

class Foo {
    private final int id;
    public Foo( int id, /*other*/ ) { // drop the int id
    }
 }

So the class could be a "sequence" and probably a long would be more appropriate to avoid have problems with the Integer.MAX_VALUE.

Using an AtomicLong as described in the java.util.concurrent.atomic package ( to avoid having two threads assign the same id, or to avoid excessive synchronization ) would help too.

class Sequencer {
    private static AtomicLong sequenceNumber = new AtomicLong(0);
    public static long next() { 
         return sequenceNumber.getAndIncrement();
    }
}

Now in each class you have

 class Foo {
      private final long id;
      public Foo( String name, String data, etc ) {
          this.id = Sequencer.next();
      }
 }

And that's it.

( note, I don't remember if deserializing the object invokes the constructor, but you get the idea )

OscarRyz
??? this is confusing... you have Sequencer as a class with non-static methods, but you are invoking Sequencer.next() as though next is static method. Also, I appreciate the help but I know how to do what you are saying to instantiate a counter; my question is more along the lines of how to manage *either* a counter-based assignment *or* read-back from the file *or* a static singleton. I can't use just one approach for constructors
Jason S
my bad I updated with the `static` for the sequencer...
OscarRyz