views:

210

answers:

1

This question relates to rcreswick's question on Serializing Jena OntModel Changes. I have Jena models on two (or more) machines that need to remain synchronized over sockets. The main issue that I need to address is that the models may contain anonymous nodes (bnodes), which can originate in any of the models.

Question: Am I on the right track here, or is there a better, more robust approach that I'm failing to consider?

I can think of 3 approaches to this problem:

  1. Serialize the complete model: This is prohibitively expensive for synchronizing small updates. Also, since changes can occur on either machine, I can't just replace machine B's model with the serialized model from machine A. I need to merge them.
  2. Serialize a partial model: Use a dedicated model for serialization that only contains the changes that need to be sent over the socket. This approach requires special vocabulary to represent statements that were removed from the model. Presumably, when I serialize the model from machine A to machine B, anonymous node IDs will be unique to machine A but may overlap with IDs for anonymous nodes created on machine B. Therefore, I'll have to rename anonymous nodes and keep a mapping from machine A's anon ids to machine B's ids in order to handle future changes correctly.
  3. Serialize individual statements: This approach requires no special vocabulary, but may not be as robust. Are there issues other than anonymous nodes that I just haven't encountered yet?
  4. Generate globally unique bnode ids (NEW): We can generate globally unique IDs for anonymous nodes by prefixing the ID with a unique machine ID. Unfortunately, I haven't figured out how to tell Jena to use my ID generator instead of its own. This would allow us to serialize individual statements without remapping bnode IDs.

Here's an example to ground this discussion a bit more. Suppose I have a list on machine A represented as:


    _:a rdf:first myns:tom
    _:a rdf:rest rdf:nil

I serialize this model from machine A to machine B. Now, because machine B may already have an (unrelated) anonymous node with id 'a', I remap id 'a' to a new id 'b':


    _:b rdf:first myns:tom
    _:b rdf:rest rdf:nil

Now the list changes on machine A:


    _:a rdf:first myns:tom
    _:a rdf:rest _:b
    _:b rdf:first myns:dick
    _:b rdf:rest rdf:nil

Since machine B has never encountered machine A's id 'b' before, it adds a new mapping from machine A's id 'b' to a new id 'c':


    _:b rdf:first myns:tom
    _:b rdf:rest _:c
    _:c rdf:first myns:dick
    _:c rdf:rest rdf:nil

The problem is further complicated with more than two machines. If there is a third machine C, for example, it may have it's own anonymous node 'a' that is different from machine A's anonymous node 'a'. Thus, machine B really does need to keep a map from each of the other machines' anonymous node IDs to its local IDs, not just from remote IDs in general to local IDs. When processing incoming changes, it must take into account where the changes came from to map the IDs correctly.

+1  A: 

Are you allowed to add your own triples to the model? If so, I would introduce a statement for every bnode, giving each an alternate public id in the form of a URN. You can now start matching bnodes between the two models.

Blank nodes or not, though, the two-way sync will only get you so far. If you are trying to detect equivalent concurrent changes on both models, strategies like this will only get you so far.

Here's an example. Let's say you are starting a new lawn care company. In order to drum up some business, you and your partner go to a local outdoor event, and try to book some discounted trial appointments. The two of you, each armed with a laptop, mingle and record anyone interested. The record is has:

address and zip
phone number
appointment dateTime

Let's say each record is stored as a resource in your model. It is possible for you to meet the husband, and your partner to meet the wife of the same household. Whether you coincidentally book the same appointment dateTime or not, the system would be hard-pressed to de-duplicate the entry. Whether you use a bnode for each record or a UUID based URI, it would not de-dup. The only hope is if you use say the phone number in some canonical form to synthesis a deterministic URI for the record.

Dilum Ranatunga
Thanks for your response! We're dealing with this isomorphism issue by confining each machine to a specific portion of the ontology. Relating back to your example, person A might work on the client list while person B is responsible for managing inventory. Our domain also has well-defined unique IDs for the relevant entities. The problem is that person A could add a list node to the client list that happens to overlap with a bnode id in the list of fertilizer suppliers.
Aaron Novstrup