This question relates to rcreswick's question on Serializing Jena OntModel Changes. I have Jena models on two (or more) machines that need to remain synchronized over sockets. The main issue that I need to address is that the models may contain anonymous nodes (bnodes), which can originate in any of the models.
Question: Am I on the right track here, or is there a better, more robust approach that I'm failing to consider?
I can think of 3 approaches to this problem:
- Serialize the complete model: This is prohibitively expensive for synchronizing small updates. Also, since changes can occur on either machine, I can't just replace machine B's model with the serialized model from machine A. I need to merge them.
- Serialize a partial model: Use a dedicated model for serialization that only contains the changes that need to be sent over the socket. This approach requires special vocabulary to represent statements that were removed from the model. Presumably, when I serialize the model from machine A to machine B, anonymous node IDs will be unique to machine A but may overlap with IDs for anonymous nodes created on machine B. Therefore, I'll have to rename anonymous nodes and keep a mapping from machine A's anon ids to machine B's ids in order to handle future changes correctly.
- Serialize individual statements: This approach requires no special vocabulary, but may not be as robust. Are there issues other than anonymous nodes that I just haven't encountered yet?
- Generate globally unique bnode ids (NEW): We can generate globally unique IDs for anonymous nodes by prefixing the ID with a unique machine ID. Unfortunately, I haven't figured out how to tell Jena to use my ID generator instead of its own. This would allow us to serialize individual statements without remapping bnode IDs.
Here's an example to ground this discussion a bit more. Suppose I have a list on machine A represented as:
_:a rdf:first myns:tom
_:a rdf:rest rdf:nil
I serialize this model from machine A to machine B. Now, because machine B may already have an (unrelated) anonymous node with id 'a', I remap id 'a' to a new id 'b':
_:b rdf:first myns:tom
_:b rdf:rest rdf:nil
Now the list changes on machine A:
_:a rdf:first myns:tom
_:a rdf:rest _:b
_:b rdf:first myns:dick
_:b rdf:rest rdf:nil
Since machine B has never encountered machine A's id 'b' before, it adds a new mapping from machine A's id 'b' to a new id 'c':
_:b rdf:first myns:tom
_:b rdf:rest _:c
_:c rdf:first myns:dick
_:c rdf:rest rdf:nil
The problem is further complicated with more than two machines. If there is a third machine C, for example, it may have it's own anonymous node 'a' that is different from machine A's anonymous node 'a'. Thus, machine B really does need to keep a map from each of the other machines' anonymous node IDs to its local IDs, not just from remote IDs in general to local IDs. When processing incoming changes, it must take into account where the changes came from to map the IDs correctly.