views:

1720

answers:

3

Hello everybody,

quoting Wikipedia:

It is perfectly acceptable to use a traditional binary tree data structure to implement a binary heap. There is an issue with finding the adjacent element on the last level on the binary heap when adding an element which can be resolved algorithmically...

Any ideas on how such an algorithm might work?

I was not able to find any information about this issue, for most binary heaps are implemented using arrays.

Any help appreciated.

+1  A: 

How about performing a depth-first search, visiting the left child before the right child, to determine the height of the tree. Thereafter, the first leaf you encounter with a shorter depth, or a parent with a missing child would indicate where you should place the new node before "bubbling up".


The depth-first search (DFS) approach above doesn't assume that you know the total number of nodes in the tree. If this information is available, then we can "zoom-in" quickly to the desired place, by making use of the properties of complete binary trees:

Let N be the total number of nodes in the tree, and H be the height of the tree.

Some values of (N,H) are (1,0), (2,1), (3,1), (4,2), ..., (7,2), (8, 3). The general formula relating the two is H = ceil[log2(N+1)] - 1. Now, given only N, we want to traverse from the root to the position for the new node, in the least number of steps, i.e. without any "backtracking". We first compute the total number of nodes M in a perfect binary tree of height H = ceil[log2(N+1)] - 1, which is M = 2^(H+1) - 1.

If N == M, then our tree is perfect, and the new node should be added in a new level. This means that we can simply perform a DFS (left before right) until we hit the first leaf; the new node becomes the left child of this leaf. End of story.

However, if N < M, then there are still vacancies in the last level of our tree, and the new node should be added to the leftmost vacant spot. The number of nodes that are already at the last level of our tree is just (N - 2^H + 1). This means that the new node takes spot X = (N - 2^H + 2) from the left, at the last level.

Now, to get there from the root, you will need to make the correct turns (L vs R) at each level so that you end up at spot X at the last level. In practice, you would determine the turns with a little computation at each level. However, I think the following table shows the big picture and the relevant patterns without getting mired in the arithmetic (you may recognize this as a form of arithmetic coding for a uniform distribution):

0 0 0 0 0 X 0 0 <--- represents the last level in our tree, X marks the spot!
          ^
L L L L R R R R <--- at level 0, proceed to the R child
L L R R L L R R <--- at level 1, proceed to the L child
L R L R L R L R <--- at level 2, proceed to the R child 
          ^                      (which is the position of the new node)
          this column tells us
          if we should proceed to the L or R child at each level

EDIT: Added a description on how to get to the new node in the shortest number of steps assuming that we know the total number of nodes in the tree.

Zach Scrivena
+3  A: 

Basically, the statement quoted refers to the problem of resolving the location for insertion and deletion of data elements into and from the heap. In order to maintain "the shape property" of a binary heap, the lowest level of the heap must always be filled from left to right leaving no empty nodes. To maintain the average O(1) insertion and deletion times for the binary heap, you must be able to determine the location for the next insertion and the location of the last node on the lowest level to use for deletion of the root node, both in constant time.

For a binary heap stored in an array (with its implicit, compacted data structure as explained in the Wikipedia entry), this is easy. Just insert the newest data member at the end of the array and then "bubble" it into position (following the heap rules). Or replace the root with the last element in the array "bubbling down" for deletions. For heaps in array storage, the number of elements in the heap is an implicit pointer to where the next data element is to be inserted and where to find the last element to use for deletion.

For a binary heap stored in a tree structure, this information is not as obvious, but because it's a complete binary tree, it can be calculated. For example, in a complete binary tree with 4 elements, the point of insertion will always be the right child of the left child of the root node. The node to use for deletion will always be the left child of the left child of the root node. And for any given arbitrary tree size, the tree will always have a specific shape with well defined insertion and deletion points. Because the tree is a "complete binary tree" with a specific structure for any given size, it is very possible to calculate the location of insertion/deletion in O(1) time. However, the catch is that even when you know where it is structurally, you have no idea where the node will be in memory. So, you have to traverse the tree to get to the given node which is an O(log n) process making all inserts and deletions a minimum of O(log n), breaking the usually desired O(1) behavior. Any search ("depth-first", or some other) will be at least O(log n) as well because of the traversal issue noted and usually O(n) because of the random nature of the semi-sorted heap.

The trick is to be able to both calculate and reference those insertion/deletion points in constant time either by augmenting the data structure ("threading" the tree, as mention in the Wikipedia article) or using additional pointers.

The implementation which seems to me to be the easiest to understand, with low memory and extra coding overhead, is to just use a normal simple binary tree structure (using a pRoot and Node defined as [data, pParent, pLeftChild, pRightChild]) and add two additional pointers (pInsert and pLastNode). pInsert and pLastNode will both be updated during the insertion and deletion subroutines to keep them current when the data within the structure changes. This implementation gives O(1) access to both insertion point and last node of the structure and should allow preservation of overall O(1) behavior in both insertion and deletions. The cost of the implementation is two extra pointers and some minor extra code in the insertion/deletion subroutines (aka, minimal).

EDIT: added pseudocode for an O(1) insert()

Here is pseudo code for an insert subroutine which is O(1), on average:

define Node = [T data, *pParent, *pLeft, *pRight]

void insert(T data)
{
    do_insertion( data ); // do insertion, update count of data items in tree

    # assume: pInsert points node location of the tree that where insertion just took place
    #  (aka, either shuffle only data during the insertion or keep pInsert updated during the bubble process)

    int N = this->CountOfDataItems + 1;  # note: CountOfDataItems will always be > 0 (and pRoot != null) after an insertion

    p = new Node( <null>, null, null, null);  // new empty node for the next insertion

    # update pInsert (three cases to handle)
    if ( int(log2(N)) == log2(N) )
     {# #1 - N is an exact power of two
     # O(log2(N))
     # tree is currently a full complete binary tree ("perfect")
     # ... must start a new lower level
     # traverse from pRoot down tree thru each pLeft until empty pLeft is found for insertion
     pInsert = pRoot;
     while (pInsert->pLeft != null) { pInsert = pInsert->pLeft; }  # log2(N) iterations
     p->pParent = pInsert;
     pInsert->pLeft = p;
     }
    else if ( isEven(N) )
     {# #2 - N is even (and NOT a power of 2)
     # O(1)
     p->pParent = pInsert->pParent;
     pInsert->pParent->pRight = p;
     }
    else 
     {# #3 - N is odd
     # O(1)
     p->pParent = pInsert->pParent->pParent->pRight;
     pInsert->pParent->pParent->pRight->pLeft = p;
     }
    pInsert = p;

    // update pLastNode
    // ... [similar process]
}

So, insert(T) is O(1) on average: exactly O(1) in all cases except when the tree must be increased by one level when it is O(log N), which happens every log N insertions (assuming no deletions). The addition of another pointer (pLeftmostLeaf) could make insert() O(1) for all cases and avoids the possible pathologic case of alternating insertion & deletion in a full complete binary tree. (Adding pLeftmost is left as an exercise [it's fairly easy].)

Roy
You understand the problem, but the O(1) argument you provide is incorrect. For example, how do you update pInsertionNode after an insertion?
@stefan.ciobaca: I agree that @Roy has no addressed that issue. However, it's easy with the addition of more pointers, say a couple in each node that "thread" across the levels of the tree. (In terms of the array, each array element points one forward and one back.)
A. Rex
Actually, given a known tree size for a complete binary tree, you can calculate where in the tree the next insert/delete nodes will be and move to them in O(1) *average* time [upper bound O(log n)] (using basically the same analysis as that determining the O(1) average insert+bubble time for heaps).
Roy
You do have to have a pointer to parent in the node structure in order to accomplish the O(1) traversal, which I inadvertently left out of the Node definition. I've now edited the post to include the revised definition. And I'll try to post some pseudocode for the O(1) traversal later today.
Roy
@roy: I think your solution doesn't produce O(1) time: Imagine you're trying to insert immediately after the rightmost child of the left subtree of a tree with height >2: you're going to have to do _two_ traversals: one all the way up to the root, and another all the way back down to the leftmost child of the right sub-tree. You're never going to know exactly how many "pParent->pParent"s you'll need, I think. Which I think means the time has got to be avg(1, 2*log h)?
quodlibetor
A: 

Recently, I have registered an OpenID account and am not able to edit my initial post nor comment answers. That's why I am responding via this answer. Sorry for this.


quoting Mitch Wheat:

@Yse: is your question "How do I find the last element of a binary heap"?

Yes, it is. Or to be more precise, my question is: "How do I find the last element of a non-array-based binary heap?".

quoting Suppressingfire:

Is there some context in which you're asking this question? (i.e., is there some concrete problem you're trying to solve?)

As stated above, I would like to know a good way to "find the last element of a non-array-based binary heap" which is necessary for insertion and deletion of nodes.

quoting Roy:

It seems most understandable to me to just use a normal binary tree structure (using a pRoot and Node defined as [data, pLeftChild, pRightChild]) and add two additional pointers (pInsertionNode and pLastNode). pInsertionNode and pLastNode will both be updated during the insertion and deletion subroutines to keep them current when the data within the structure changes. This gives O(1) access to both insertion point and last node of the structure.

Yes, this should work. If I am not mistaken, it could be a little bit tricky to find the insertion node and the last node, when their locations change to another subtree due to an deletion/insertion. But I'll give this a try.

quoting Zach Scrivena:

How about performing a depth-first search...

Yes, this would be a good approach. I'll try that out, too.

Still I am wondering, if there is a way to "calculate" the locations of the last node and the insertion point. The height of a binary heap with N nodes can be calculated by taking the log (of base 2) of the smallest power of two that is larger than N. Perhaps it is possible to calculate the number of nodes on the deepest level, too. Then it was maybe possible to determine how the heap has to be traversed to reach the insertion point or the node for deletion.

Sorry if my post wasn't clear enough about the "calculation". I've modified it to be more explanatory.
Roy
@Yse: I've updated my answer to reply to your question.
Zach Scrivena
@yse: send an email to one of the moderators, they might be able to reassociate your accounts.
Roy