views:

525

answers:

3

I'm looking for a good data structure to build equivalence classes on nodes of a tree. In an ideal structure, the following operations should be fast (O(1)/O(n) as appropriate) and easy (no paragraphs of mystery code):

  • (A) Walk the tree from the root; on each node --> child transition enumerate all the equivalent versions of the child node
  • (B) Merge two equivalence classes
  • (C) Create new nodes from a list of existing nodes (the children) and other data
  • (D) Find any nodes structurally equivalent to node (i.e. they have the same number of children, corresponding children belong to the same equivalence class, and their "other data" is equal) so that new (or newly modified) nodes may be put in the right equivalence class (via a merge)

So far I've considered (some of these could be used in combination):

  • A parfait, where the children are references to collections of nodes instead of to nodes. (A) is fast, (B) requires walking the tree and updating nodes to point to the merged collection, (C) requires finding the collection containing each child of the new node, (D) requires walking the tree
  • Maintaining a hash of nodes by their characteristics. This makes (D) much faster but (B) slower (since the hash would have to be updated when equivalence classes were merged)
  • String the nodes together into a circular linked list. (A) is fast, (B) would be fast but for the fact that that "merging" part of a circular list with itself actually splits the list (C) would be fast, (D) would require walking the tree
  • Like above, but with an additional "up" pointer in each node, which could be used to find a canonical member of the circular list.

Am I missing a sweet alternative?

+3  A: 

I don't think any one structure is going to solve your problems, but you might take a look at the Disjoint-set data structure. An equivalence class, after all, is the same thing as a partitioning of a set. It should be able to handle some of those operations speedily.

Jay Kominek
The solutions outlined in the link are basically a subset of the ones I listed above (with the minor exception of tree-flattening, which I'd considered an implicit part of the up-pointer case). Is your answer then "no, you aren't missing any sweet alternatives"?
MarkusQ
+1  A: 

You seem to have two forms of equivalence to deal with. Plain equivalence (A), tracked as equivalence classes which are kept up to date and structural equivalence (D), for which you occasionally go build a single equivalence class and then throw it away.

It sounds to me like the problem would be conceptually simpler if you maintain equivalence classes for both plain and structural equivalence. If that introduces too much churn for the structural equivalence, you could maintain equivalence classes for some aspects of structural equivalence. Then you could find a balance where you can afford the maintenance of those equivalence classes but still greatly reduce the number of nodes to examine when building a list of structurally equivalent nodes.

Kevin Beck
The "structural equivalence" is more of an index, to facilitate discovery of new matches (e.g. if I know A:{x = sqrt(z+a+7)} and B:{y = sqrt(z+b+7)} then learn C:{a=b} it facilitates discovering that I can merge A and B). But your suggestion makes sense (e.g. indexing them by top-level operator).
MarkusQ
+1  A: 

Stepping back for a moment I'd suggest against using a tree at all. Last time I had to confront a similar problem, I began with a tree, but later moved onto an array.

Reasons being multiple but number one reason was performance, my classes with up to 100 or so children would actually perform better while manipulating them as array than through the the nodes of a tree, mostly because of hardware locality, and CPU prefetch logic, and CPU pipelining.

So although algorithmically an array structure requires a larger N of operations than a tree, performing these dozens of operations is likely faster than chasing pointers across memory.

Robert Gould
Yeah, the "tree" will probably end up being stored as an array of TAC or some such. But by the very nature of the overall algorithm I think locality is at risk.
MarkusQ