How to transfer a binary tree (not a balanced tree) across two different systems efficiently, retaining its complete structure?
The obvious way would be to convert your binary tree to an array of nodes, replacing each pointer in the original tree with an index to a node in the array. You can then transmit that array, and on the other end reconstruct a tree with identical structure.
This structure given below
[x]
/ \
[L] [R]
\
[P]
can be translated easily into
(X,(L,-,(P,-,-)),(R,-,-))
Also, read a post by Eric Lippert.
NOTE: I feel, similar thing should work for arbitrary trees. Any comments?
Define serialization functions.
void serialize( FILE *f, my_tree *node, _Bool is_root ) {
if ( node == NULL ) {
fputc( no_child, f );
return;
}
if ( ! is_root ) fputc( data_prefix, f );
write_data( f, node->data );
fputc( data_terminator, f );
write_data( node->left_child );
write_data( node->right_child );
}
void deserialize_node( FILE *f, my_tree *node ) {
node->data = read_data_field( f );
if ( fgetc( f ) != no_child ) {
node->left_child = calloc( 1, sizeof( my_tree ) );
deserialize( f, node->left_child, false );
}
if ( fgetc( f ) != no_child ) {
node->right_child = calloc( 1, sizeof( my_tree ) );
deserialize( f, node->right_child, false );
}
}
Come to think of it, this simple scheme (where data_terminator
and no_child
must be single characters) allows both data_terminator
and no_child
to be equal.
The main issue with this is that you have to replace pointers or references from your in memory representation with something else that can be used to unambiguously represent the node that was pointed to.
foo
/ \
cat zebra
\
dog
One way to do this is to exchange the pointers for keys -- more like an array index than a proper pointer.
1 2 "foo"
3 _ "cat"
_ _ "zebra"
_ _ "dog"
In this representation the first field is the line number (counting starts at 0, which is the root node) of the left child, the second field is the right child, and the third field is the value. The tree is ordered alphabetically. This seems simple, but can be difficult to actually do.
A similar approach would put the key in each entry rather than rely on position. This method could use the original pointers as the keys and the read-in code would have to build a translation/symbol table to switch between the keys and new pointers.
Another way to go about this is with a lisp-esque tree: (foo (cat () (dog () ()) (zebra () () ))
Formatted for easy viewing:
(foo
(cat
()
(dog
()
()
)
)
(zebra
()
()
)
)
This can be easily generated by a simple in order traversal. It can also be read in with a very simple recursive decent parser. You can also alter this to decrease the sizes of leaf nodes in the serialized format by omitting the nil
or ()
or whatever you chose for NULL pointers.
Another method, which is similar to the first, is to store all of tree in one chunk of memory that can be dumped to and read back from disk. The pointers in this would be relative to the beginning of this memory chunk, rather than absolute pointers. This would be a fast way for two programs on the same type of machine (using the same CPU memory width) to share trees (or other graphs), but is likely to be difficult to implement.
The lisp-esqe version of this is super easy to implement, but does not easily extend to things that aren't trees, where there could be a cyclic reference or more than one parent for a particular node, though it can be done. It also does not easily extend to handle storing more than one structure in a particular file.
The line positional index version works for most types of graphs, but storing more than one structure in a particular file would need to alter this format somewhat.
No matter what you choose you will need to make sure you can handle all values that could be present as node data. For instance if the node data could contain a "
, )
, or \n
then it might cause problems in some of the formats I've show, and those characters would need to be escaped. You could prefix fields with their length or use constant structure layout to account for this, though.
You will also need to make sure that any binary fields are stored in an endian consistent manner if you plan on sharing data between different machine types. You will also want this data to have consistent size (use stdint.h types rather than int and long) and a canonical representation for things like floating point numbers.