views:

178

answers:

2

I'm looking through possible representations for what can be considered a finite depth graph in XML format for data exchange purposes. The problematic point is how to reference nodes in edge tags. Two strategies I see are a) using unique identifiers or b) using paths.

Unique IDs:

<graph id="g0">
  <node id="n0"/>
  <node id="n1"/>
  <edge from="n1" to="n0"/>
</graph>
<graph id="g1">
  <node id="n2"/>
</graph>
<edge from="n2" to="n1"/>

Paths:

<graph id="0">
  <node id="0"/>
  <node id="1"/>
  <node id="2"/>
  <edge from="1" to="0"/>
  <edge from="2" to="1"/>
</graph>
<graph id="1">
  <node id="0"/>
</graph>
<edge from="1:0" to="0:2"/>

What is the standard procedure for these kinds of things? From what I've gathered, unique identifier approach seems to be more widespread. My issue with that is when graphs are becoming very large, there's:

  • necessity for a really large hash table that maps objects to their IDs for purposes of reading/writing edges from/to XML files
  • the file itself is larger than the one written using paths because you cannot omit redundant path components if edge is internal to the graph.

Thoughts?

Update 1:

Note that its not one flat graph; its one or more graphs interconnected. They each have locally indexed elements, but flattening them all out and keeping track of edges across them is a bit of a nuisance.

Update 1.1: Noticed that with sub-graphs in GraphML, they do in fact use complex keys that makes it possible to separate local node id out of the global one.

Update 2:

Yes, obviously this is not a well formed XML and missing tags and all sorts of schema declarations.

+3  A: 

There is a schema describing such graph: see GraphML

Example:

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"  
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
     http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"&gt;
  <graph id="G" edgedefault="undirected">
    <node id="n0"/>
    <node id="n1"/>
    <node id="n2"/>
    <node id="n3"/>
    <node id="n4"/>
    <node id="n5"/>
    <node id="n6"/>
    <node id="n7"/>
    <node id="n8"/>
    <node id="n9"/>
    <node id="n10"/>
    <edge source="n0" target="n2"/>
    <edge source="n1" target="n2"/>
    <edge source="n2" target="n3"/>
    <edge source="n3" target="n5"/>
    <edge source="n3" target="n4"/>
    <edge source="n4" target="n6"/>
    <edge source="n6" target="n5"/>
    <edge source="n5" target="n7"/>
    <edge source="n6" target="n8"/>
    <edge source="n8" target="n7"/>
    <edge source="n8" target="n9"/>
    <edge source="n8" target="n10"/>
  </graph>
</graphml>
Pierre
A: 

the file itself is larger than the one written using paths because you cannot omit redundant path components if edge is internal to the graph.

This point is premature optimization. XML parsers/writers aren't going to choke on large files, and if storage size is a concern, XML usually compresses very well with ZIP.

necessity for a really large hash table that maps objects to their IDs for purposes of reading/writing edges from/to XML files

That's an implementation concern. You can certainly avoid having a large hash table like this if you write your XML read/write routines into the graph, node and edge classes themselves rather than trying to maintain the mapping in a separate structure. Graphs are pretty easy to serialize and deserialize.

Unique IDs are probably the way to go. If you structure the IDs in a manner similar to the hierarchal way you proposed it will be relatively human readable as well, which is one of XML's goals.

Welbog