views:

325

answers:

6

I have a C++ class representing a hierarchically organised data tree which is very large (~Gb, basically as large as I can get away with in memory). It uses an STL list to store information at each node plus iterators to other nodes. Each node has only one parent, but 0-10 children. Abstracted, it looks something like:

struct node {
public:
    node_list_iterator parent;              // iterator to a single parent node
    double node_data_array[X];
    map<int,node_list_iterator> children;   // iterators to child nodes
};

class strategy {
private:
    list<node> tree;        // hierarchically linked list of nodes
    struct some_other_data;
public:
    void build();           // build the tree
    void save();            // save the tree from disk
    void load();            // load the tree from disk
    void use();             // use the tree
};

I would like to implement the load() and save() to disk, and it should be fairly fast, however the obvious problems are:

  1. I don't know the size in advance;

  2. The data contains iterators, which are volatile;

  3. My ignorance of C++ is prodigious.

Could anyone suggest a pure C++ solution please?

+1  A: 

You can use boost.serialization library. This would save entire state of your container, even the iterators.

buratinas
I don't think boost::serialization is a help because the members of the list contain iterators to inside the list. Massive recursion problems ahead :)
Billy ONeal
it's not a problem if appropriate serialization routine is implemented.
buratinas
The fundamental issue with boost::serialization will be the iterator. The rest is already done for the OP, but there isn't any serialization support for iterators in the library, so that would have to be rolled custom. There is periodic chatter on how to do it on the boost lists, but nothing has been done that I can find.
McBeth
+1  A: 

boost.serialization is a solution, or IMHO, you can use SQLite + Visitor pattern to load and save these nodes, but it won't be easy as it sounds.

baris_a
+1  A: 

It seems like you could save the data in the following syntax:

File = Meta-data Node
Node = Node-data ChildCount NodeList
NodeList = sequence (int, Node)

That is to say, when serialized the root node contains all nodes, either directly (children) or indirectly (other descendants). Writing the format is fairly straightforward: just have a recursive write function starting at the root node.

Reading isn't that much harder. std::list<node> iterators are stable. Once you've inserted the root node, its iterator will not change, not even when inserting its children. Hence, when you're reading each node you can already set the parent iterator. This of course leaves you with the child iterators, but those are trivial: each node is a child of its parents. So, after you've read all nodes you'll fix up the child iterators. Start with the second node, the first child (The first node one was the root) and iterate to the last child. Then, for each child C, get its parent and the child to its parent's collection. Now, this means that you have to set the int child IDs aside while reading, but you can do that in a simple std::vector parallel to the std::list<node>. Once you've patched all child IDs in the respective parents, you can discard the vector.

MSalters
So I don't have to store the children information...
supert
Not explicitly. You'd effectively derive that from the position in the file: following its parent node. This is easy to realize when writing, by starting with the root node. And when reading, you recover the child relations by inverting all parent relations.
MSalters
+1  A: 

Boost Serialization has already been suggested, and it's certainly a reasonable possibility.

A great deal depends on how you're going to use the data -- the fact that you're using a multiway tree in memory doesn't mean you necessarily have to store it as a multiway tree on disk. Since you're (apparently) already pushing the limits of what you can store in memory, the obvious question is whether you're just interested in serializing the data so you can re-constitute the same tree when needed, or whether you want something like a database so you can load parts of the information into memory as needed, and update records as needed.

If you want the latter, some of your choices will also depend on how static the structure is. For example, if a particular node has N children, is that number constant or is it subject to change? If it's subject to change, is there a limit on the maximum number of children?

If you do want to be able to traverse the structure on disk, one obvious possibility would be as you write it out, substitute the file offset of the appropriate data in place of the iterator you're using in memory.

Alternatively, since it looks like (at least most of) the data in an individual node has a fixed size, you might create a database-like structure of fixed-size records, and in each record record the record numbers of the parent/children.

Knowing the overall size in advance isn't particularly important (offhand, I can't think of any way I'd use the size even if it was known in advance).

Jerry Coffin
+1  A: 

Actually, I think your best option is to move the entire data structure into database tables. That way you get the benefit of people much smarter then you (or me) having dealt with issues of serialization. It will also prevent you from having to worry about whether the structure can fit into memory.

Winston Ewert
A: 

I've answered something like this on SO before, so I will summarize:
1. Use a database.
2. Substitute file offsets for links (pointers).
3. Store the data without the tree structure, in records, as a database would.
4. Use XML to create the tree structure, using node names instead of links.
5. This would be soooo much easier if you used a database like SqLite or MySQL.

When you spend too much time on the "serialization" and less on the primary purpose of your project, you need to use a database.

Thomas Matthews
supert