ansaurus

Question

What's the fastest way to deserialize a tree in C++

Answer 1

+1 A:

I highly recommend the boost serialization library. It should work with the solutions you're using.

Maciek 2009-12-16 10:42:15

I second this : Boost is a nice solution and handles automatically all the child/parent relation ship. It's worth investigating, given that the benchmark show that QHash (the current solution for child/parent) is what's eating up most of the time. It's also available on a wide range of platforms. On the other hand I have no idea how well Boost plays with QT.

DrYak 2009-12-16 15:16:18

Answer 2

+1 A:

The absolute fastest way of serialising/deserialising is writing a block of contiguous memory to disk as you say. If you changed your tree structure to create this (probably using a custom allocation routine) this would be very easy.

Unfortunately I'm not that familiar with QHash, but from looking at it it looks like a Hashtable rather than a tree. Have I misunderstood you? Are you using this to map duplicate nodes?

I'd use a profiler (I used to use Quantify, now called Rational PurifyPlus, but there are a lot listed here) to find where you are using time, but I'd guess it is either multiple memory allocations rather than a single allocation, or multiple reads rather than a single read. To solve both these problems you know in advance (because you store it) how many nodes you need, then write/read an array of nodes of the correct length, where each pointer is an index into the array, rather than a pointer in memory.

Nick Fortescue 2009-12-16 10:46:12

Each node of the tree has a key and a hashtable to its leafes. Each leaf is dereferenced by an arbitrary number. To be precise: A node x has n leafes y_1 ... y_n, each edge from x to y_i is labeled with the edit distance from d(x, y_i) (see http://en.wikipedia.org/wiki/BK-tree).

Wolfgang Plaschg 2009-12-16 12:05:23

+1. for profiling before optimising ...

neuro 2009-12-16 14:10:22

Answer 3

A:

Another solution would be to use your own memory allocator, which will use a continuous memory space. Then you'll be able to dump the memory as is and load it back. It's platform (i.e. big endian/little endian, 32bit/64bit) sensitive.

Drakosha 2009-12-16 10:48:10

-1 for this idea : You mention some problems but the reality is this is also compiler, optimization level and debug/release sensitive - not to mention extending the tree in the future and handling migration nicely.

RnR 2009-12-16 12:47:42

+1 to offset: With a suitable level of abstraction that's certainly possible - e.g. using iterators, and storing offsets instead of pointers. Especially for a "build once, never modify" an arena allocator is extremely efficient. Platform portability *is* a problem, and it's probably not going to solve the OP's problem, though.

peterchen 2009-12-16 15:46:15

Answer 4

A:

As you said, allocating objects with new might be slow. That can be improved allocating an object pool and then using pre-allocated objects until the pool is exhausted. You might even be able to implement this to work in background by overloading the new/delete operators of the class in question.

2009-12-16 10:53:04

Answer 5

+4 A:

First of all - profile your application so that you know what takes time - basing the suspicion on new because you've read somewhere it can be slow or on the iteration through the tree is not enough.

It's possible it's the IO operations - maybe your file format is not correct/inefficient.

Maybe you just have a defect somewhere?

Or maybe there's a quadratic loop somewhere that you don't remember about causing the problems? :)

Measure what really takes time in your case and then approach the issue - it'll save you a lot of time and you'll avoid breaking your design/code to fix performance issues that don't exist before finding the real cause.

RnR 2009-12-16 12:53:54

+1. I totally agree. Always profile before optimisation. Even if your gues is right, you will know exactly how much you have gain for a given optimisation.

neuro 2009-12-16 14:09:37

Each node is stored with an overloaded `<<` operator into a QDataStream. This is the recommended way to store Qt objects. No, there's no quadratic loop. I did some profiling and the results falsified my supposition (see edited question).

Wolfgang Plaschg 2009-12-16 14:56:34

Answer 6

A:

Your own memory allocation with an overloaded operator new() and delete() is a low-cost option (development time). However, this only affects the memory allocation time, and not the Ctor times. Your mileage may vary, but may worth a try.

Denes Tarjan 2009-12-16 12:54:21

Answer 7

A:

Some issues with serializing and deserializing a tree:
1. Pointers don't translate very well.
An object in memory many be at the same address when it is created again.
There may not be a method to allocate the object at the given location.

2.Serializing in the wrong order may defeat the purpose of the tree.
For example, serialize the tree using pre-order traversal. When creating the tree from the file, the tree will be in order and will not be as efficient; will look like a linked list.

You may want to serialize the objects instead of the tree structure. This will allow you to change the data structure in the future if necessary.

Thomas Matthews 2009-12-16 19:35:09

Answer 8

A:

I'll expand my comment a bit:

Since your profiling suggests that the QHash serialization takes the most time, I believe that replacing QHash with a QList would yield a significant improvement when it comes to deserialization speed.

The QHash serialization just outputs the key/value pairs, but the deserialization constructs a hash data structure!

Even if you said that you need the fast child lookup, I would recommend that you try replacing QHash with a QList > as a test. If there aren't many children for each node (say, less than 30), the lookup should still be fast enough even with a QList. If you find that QList is not fast enough, you could still use it just for (de)serializaton and later convert to a hash once the tree has been loaded.

rpg 2009-12-17 09:22:40

Answer 9

+1 A:

Another approach would be to serialize your pointers and restore them when loading. I mean:

Serializing:

nodeList = collectAllNodes();

for n in nodelist:
 write ( &n )
 writeNode( n ) //with pointers as-they-are.

Deserializing:

//read all nodes into a list.
while ( ! eof(f))
    read( prevNodeAddress)
    readNode( node )
    fixMap[prevNodeAddress] = &node;
    nodeList.append(node);

//fix pointers to new values.
for n in nodeList:
    for child in n.children:
        child->node = fixMap[child->node]

This way if you don't insert-remove new nodes you can allocate a vector once and use that memory, reducing your allocation to the maps ( as rpg said, it might be faster with lists or even vectors).

davidnr 2009-12-17 15:58:14

Nice answer! Thank you

Wolfgang Plaschg 2009-12-18 14:15:28

ansaurus

tags:

views:

answers:

What's the fastest way to deserialize a tree in C++

related questions