ansaurus

Question

How to compress repeating branches in directed graph?

Answer 1

A:

Without thinking about your problem in too much depth, I have solved my problem of compressing web data for my research using a trie. You should be able to serialize your data onto a trie for the purposes of compression.

Hassan Syed 2009-12-16 14:47:29

I don't understand what you mean. Remember we don't know in advance where our candidate molecule begins. In the above diagrams, A1 and A5 themselves have parents and are just nodes we come across during the DFS. The puzzle is to recognize that A5 is just a copy of A1.

Dave Griffiths 2010-01-12 10:22:16

Answer 2

A:

From what I can tell, this is likely to be related to the graph isomorphism problem. The wikipedia page has a few pointers to current approaches to this problem. However, from a brief skimming, most of these seem to be designed to consider two entire graphs, and not look for isomorphic subgraphs within a larger graph.

With respect to search algorithms, my gut feeling is that depth-first search is not the right approach for this problem. You might think for a bit about what a breadth-first traversal might do. At least in the specific example you describe, that would allow you to look at A1 and A5 first, before committing to a specific "molecule" shape at either one.

Dale Hagglund 2010-01-12 10:55:23

That's interesting, thanks for the pointer. I've clarified the question above with respect to hash collisions. I can understand that there may be NP problems if you were trying to solve the problem exactly but I would be happy with something less than 100% reliable.

Dave Griffiths 2010-01-12 11:18:44

From a pure complexity theory point of view, it was interesting to me that graph isomorphism isn't known to be NP-complete. Regardless of that, however, I agree that you want to look for some sort of heuristic approach. On small graphs, you could visit each node and build up a candidate set of "small" shapes it belongs to, and then select the shapes that get you the most compression. This idea might have trouble scaling to realistic graphs with 100s of thousands or millions of objects, though.

Dale Hagglund 2010-01-12 11:31:38

Answer 3

A:

This is probably a too abstract answer to actually help, but you can create a subgraph per repeated pattern, then collapse each pattern occurrence to a single node (with a pointer to the respective subgraph structure). The node would manage all edges connected to the pattern. Such edges must also remember the node of the pattern they connect to, so you would be able to offer graph traversals which hide the details of the representation, but traverse the graph as if it was "as it meant to be". This would be complicated if you fail to abstract the internal representation and your algorithms need to understand nested graphs.

As a side note, while graph isomorphism is generally a tough problem, in your case you have so much metadata on your graph (e.g. object types, field names, etc), which is equivalent to having a labeled graph, with very rare, selective labels. Such labels greatly prune the required effort to find isomorphic patterns (up to a small size of course, otherwise your pattern cache would fill all memory).

Since there are going to be lots of objects that follow closely the class definitions (if there was no inheritance, objects would be structs and their runtime types would exactly match the definitions), I predict that what you're trying to do has lots of potential to significantly compress the object graph.

Dimitris Andreou 2010-01-14 01:04:31

ansaurus

tags:

views:

answers:

How to compress repeating branches in directed graph?

related questions