views:

145

answers:

3

I have tried again to ask the same question, but I ended up asking a different question by not providing essential information about my problem.

I implement a data structure which is a tree. Each node of this tree has an array/vector/(random access structure) with all its children which can be arbitrary many. The insertion/removal of elements is easy since we always double/divide-by-two the number of the elements in this array.

That is what the O(k) insertion/removal means in this context. We have k elements and we append k more or delete k/2. Rebuilding the whole data structure is fine so far. A dynamic array (or a vector) works.

The operation that raises the question is the following. Sometimes we have to "split" a node having n children, which means we "divide" the children among different nodes. The way we dive is in continuous groups. The most efficient way for this split is, I guess, to have one pointer for each new node at the position where its children are and how many there are (let's say each node takes k children). But then, the number of these children may change and that should not affect its siblings (or even worse, the whole tree) i.e the execution time of an insertion should be O(k) and not O(n). How do we do it?

An easy but inefficient work-around would be that every time we split a node, we replace the "big" array of children with many (as many as the split parts) "small" dynamic arrays.

Each of the following "boxes" is a random access structure.

alt text

+3  A: 

How about a hashmap?

It's got amortized O(1) access, insertion and removal average case; O(n) worst case.

NullUserException
Note that amortized O(1) is the *best* case.
Billy ONeal
Using the simple work-around mentioned at the question we obtain `O(1)` worst-case access time and `O(1)` amortized time insertion/removal (`k` is the size of every "small" array and we always add/remove a bunch of `O(k)` elements). Furthemore what @Billy ONeal said.
myle
+1  A: 

From the description you gave of the tree structure you are implementing, it might be best to create a new data structure to mimic your tree. Especially if you're already tracking pointers between nodes.

If I understand your statement, each node in your tree would contain a vector of children node pointers. When you need to split the node, you could create new nodes with each one receiving a segment of the vector of children pointers, and the newly created nodes would be inserted into the node vector of the parent node.

for example:

N1->N2->(n3,n4,n5,n6,n7,n8) split N2 into two nodes: N1->(N2_1, N2_2) with N2_1->(n3,n4,n5) and N2_2->(n6,n7,n8)

(Sorry, I don't know how to easily draw trees...)

This way, you are only relinking memory rather than copying, and access will generally be log n. Furthermore, this gives a proper representation of the tree structure in code.

edit adding example:

Suppose again, we have N1->N2->(n3,n4,n5,n6,n7,n8). If N1 should need to have new nodes added, the only affect is on the N1 node: N1->(N2,N9)->(n3,n4,n5,n6,n7,n8)

The structure of a node might be like this (very simplified):

class Node {
  vector<Node *> children;
  Node * parent;
};

The larger tree structure would be many of these nodes all linked together much like a binary tree. To add nodes to any one node on a tree would only add items to the children member of that node. Nothing else is affected.

JoshD
You are right. As far as splitting is concerned, only relinking takes place. But the number of children of a particular node can grow due to external reasons. That is where the problem arises. How we do it efficiently without affecting all the nodes of the structure?
myle
As I said, the nodes would have `vectors<Node *>`. In that way, you can add new nodes to any given node without affecting anything else in the whole tree structure. As an example, in my answer node N1 could have five new nodes added to it, but it would only alter the vector of noes in N1. All other data in the tree (N2-n8) aren't affected. Does that work?
JoshD
How do you split efficiently the `vector` and then add new elements in its parts?
myle
If you wanted to split a node into three nodes (for example), you would create three new nodes, then assign the values from the old node into the new nodes. The first node would get the first third of the entries, the second node, the middle third, and so on. The complexity of this split would increase in proportion the the number of values in the vector. So if you split a node with k children in a tree with n nodes, it's O(k).
JoshD
I suppose I've misunderstood the question a bit. It seems you've already considered this option and found it unsuitable. I hope you find what you're searching for. Good luck!
JoshD
Thanks for your insightful comments. We suggest the same work-around here. At every split, delete the "old" `vector<Node *>` and build "smaller new" ones.
myle
@myle: To avoid complete reallocation of the vector, you could use a list and the splice function. That has move semantics, so you don't have to copy or reallocate the contents; it can be moved. That would give somewhat better performance for the split operation.
JoshD
+1  A: 

It looks to me like you're trying to reinvent the B+ Tree...

There is a variation of the B+ Tree which consists in storing the number of sub elements instead of a proper key, this allows efficient random access (log n) which factor you actually control.

For example, a common factor for databases is around 1,000, meaning that at the first level of the tree is the root, the second can contain up to 1,000 children, the third up to 1,000,000 etc...

If you have less than 1,000,000,000 objects, it means at most 3 dereferences, which is quite efficient.

There has been a python module written using this technic, called blist, which aimed at replacing the traditional list class (implemented as a C++ vector) with a B+ Tree to get more efficiency for large items. The performances measurements can be found here.

Matthieu M.
The difference with the `B+ Tree` is that a node may have an arbitrary number of children (instead of `b/2` < children < `b`) which changes (a lot) the procedures of insertion/deletion and the particular characteristics of the data structure.
myle
@myle: Yes I spotted that, but there is a reason for the B+ Tree to do so --> avoid (costly) reallocation. The objects are still moved around though.
Matthieu M.