views:

134

answers:

5

How to transfer a binary tree (not a balanced tree) across two different systems efficiently, retaining its complete structure?

+6  A: 

The obvious way would be to convert your binary tree to an array of nodes, replacing each pointer in the original tree with an index to a node in the array. You can then transmit that array, and on the other end reconstruct a tree with identical structure.

Jerry Coffin
+1. That is what I have used myself in past.
Dummy00001
A: 

FAQ 36.9 should help.

dirkgently
Although there is a lot to learn at that link, the question is tagged C and *not* C++. It needs to be read with that in mind...
RBerteig
@dirkgently: Uhm... question is tagged `C` and you point to `C++` FAQ? And I have read the link, and it is one of those silly, rather useless (== filled with obvious) articles C++ FAQ posts sometimes.
Dummy00001
@Dummy00001: Can you tell me what is there that you find silly? Leaving apart the C++ bit I find it pretty instructive. YMMV.
dirkgently
@dirkgently: the article was written by the Captain Obvious, obviously: "The key to serializing these graphs is to ignore a node's identity and instead to focus only on its contents. A (typically recursive) algorithm dives through the tree and writes the contents as it goes." Essentially "dump all the stuff the way you would have dump it anyway".
Dummy00001
Obvious to you, but not to the target audience.
Skurmedel
@Dummy00001: A lot of CS/Engineering is based on the obvious. Unfortunately, it still needs to be rubbed in from time to time. At least for me. And I still find it useful.
dirkgently
+7  A: 

This structure given below

    [x]
   /   \
 [L]   [R]
   \
   [P]  


can be translated easily into

(X,(L,-,(P,-,-)),(R,-,-))

Also, read a post by Eric Lippert.

NOTE: I feel, similar thing should work for arbitrary trees. Any comments?

TheMachineCharmer
for uniformity, `(P,-,-)`
Potatoswatter
Yes. Thanks for pointing that out! :)
TheMachineCharmer
Note for myself : To get up votes mention Eric Lippert!! ;D
TheMachineCharmer
For arbitrary trees, the '-' is not needed.(X,(L,(P)),(R))would do. Or won't it?
lalli
@lalli Yes,you are right. +1
TheMachineCharmer
+3  A: 

Define serialization functions.

void serialize( FILE *f, my_tree *node, _Bool is_root ) {
    if ( node == NULL ) {
        fputc( no_child, f );
        return;
    }

    if ( ! is_root ) fputc( data_prefix, f );
    write_data( f, node->data );
    fputc( data_terminator, f );

    write_data( node->left_child );
    write_data( node->right_child );
}

void deserialize_node( FILE *f, my_tree *node ) {
    node->data = read_data_field( f );

    if ( fgetc( f ) != no_child ) {
         node->left_child = calloc( 1, sizeof( my_tree ) );
         deserialize( f, node->left_child, false );
    }

    if ( fgetc( f ) != no_child ) {
         node->right_child = calloc( 1, sizeof( my_tree ) );
         deserialize( f, node->right_child, false );
    }
}

Come to think of it, this simple scheme (where data_terminator and no_child must be single characters) allows both data_terminator and no_child to be equal.

Potatoswatter
-1 because this is a C++ answer for a C question
Jens Gustedt
bah! need to pay attention.
Potatoswatter
@Jens: translated.
Potatoswatter
@Potatoswatter: ok, thanks, downvote removed
Jens Gustedt
+1  A: 

The main issue with this is that you have to replace pointers or references from your in memory representation with something else that can be used to unambiguously represent the node that was pointed to.

     foo
    /   \
 cat     zebra
    \
     dog

One way to do this is to exchange the pointers for keys -- more like an array index than a proper pointer.

1 2 "foo"
3 _ "cat"
_ _ "zebra"
_ _ "dog"

In this representation the first field is the line number (counting starts at 0, which is the root node) of the left child, the second field is the right child, and the third field is the value. The tree is ordered alphabetically. This seems simple, but can be difficult to actually do.

A similar approach would put the key in each entry rather than rely on position. This method could use the original pointers as the keys and the read-in code would have to build a translation/symbol table to switch between the keys and new pointers.

Another way to go about this is with a lisp-esque tree: (foo (cat () (dog () ()) (zebra () () ))

Formatted for easy viewing:

(foo
   (cat
      ()
      (dog
         ()
         ()
      )
   )
   (zebra
        ()
        ()
   )
)

This can be easily generated by a simple in order traversal. It can also be read in with a very simple recursive decent parser. You can also alter this to decrease the sizes of leaf nodes in the serialized format by omitting the nil or () or whatever you chose for NULL pointers.

Another method, which is similar to the first, is to store all of tree in one chunk of memory that can be dumped to and read back from disk. The pointers in this would be relative to the beginning of this memory chunk, rather than absolute pointers. This would be a fast way for two programs on the same type of machine (using the same CPU memory width) to share trees (or other graphs), but is likely to be difficult to implement.

The lisp-esqe version of this is super easy to implement, but does not easily extend to things that aren't trees, where there could be a cyclic reference or more than one parent for a particular node, though it can be done. It also does not easily extend to handle storing more than one structure in a particular file.

The line positional index version works for most types of graphs, but storing more than one structure in a particular file would need to alter this format somewhat.

No matter what you choose you will need to make sure you can handle all values that could be present as node data. For instance if the node data could contain a ", ), or \n then it might cause problems in some of the formats I've show, and those characters would need to be escaped. You could prefix fields with their length or use constant structure layout to account for this, though.

You will also need to make sure that any binary fields are stored in an endian consistent manner if you plan on sharing data between different machine types. You will also want this data to have consistent size (use stdint.h types rather than int and long) and a canonical representation for things like floating point numbers.

nategoose
+1 Thorough answer.
Skurmedel