ansaurus

Question

Answer 1

+2 A:

I have understood your problem as:

Each bucket has an internal order and buckets themselves have an order, so all the elements have some ordering and you need the ith element in that ordering.

To solve that:

What you can do is maintain a 'cumulative value' tree where the leaf nodes (x1, x2, ..., xn) are the bucket sizes. The value of a node is the sum of values of its immediate children. Keeping n a power of 2 will make it simple (you can always pad it with zero size buckets in the end) and the tree will be a complete tree.

Corresponding to each bucket you will maintain a pointer to the corresponding leaf node.

Eg, say the bucket sizes are 2,1,4,8.

The tree will look like

If you want the total count, read the value of the root node.

If you want to modify some xk (i.e. change correspond bucket size), you can walk up the tree following parent pointers, updating the values.

For instance if you add 4 items to the second bucket it will be (the nodes marked with * are the ones that changed)

     19*
    /   \
   7*    12
  / \   / \
 2  5*  4  8

If you want to find the ith element, you walk down the above tree, effectively doing the binary search. You already have a left child and right child count. If i > left child node value of current node, you subtract the left child node value and recurse in the right tree. If i <= left child node value, you go left and recurse again.

Say you wanted to find the 9th element in the above tree:

Since left child of root is 7 < 9. You subtract 7 from 9 (to get 2) and go right.

Since 2 < 4 (the left child of 12), you go left.

You are at the leaf node corresponding to the third bucket. You now need to pick the second element in that bucket.

If you have to add a new bucket, you double the size of your tree (if needed) by adding a new root, making the existing tree the left child and add a new tree with all zero buckets except the one you added (which we be the leftmost leaf of the new tree). This will be amortized O(1) time for adding a new value to the tree. Caveat is you can only add a bucket at the end, and not anywhere in the middle.

Getting the total count is O(1). Updating single bucket/lookup of item are O(logn).

Adding new bucket is amortized O(1).

Space usage is O(n).

Instead of a binary tree, you can probably do the same with a B-Tree.

Moron 2010-06-25 17:46:02

I'm glad you look interested in the problem :) However if I were to only add buckets at the end or the begin the problem would be somewhat easier. Instead I want to be able to insert buckets right in the middle. Your structure is still interesting in that regard, it looks much like a standard B-Tree. For example I could replace the leaf 4 by a sub-tree `7 (4 3)`, but it will unbalance the tree. You gave me food for thoughts :)

Matthieu M. 2010-06-26 12:29:15

@Matthieu: It is an interesting problem :-) So did I understand it right? Also, the insertion, is it next to the bucket you modify or could it be arbitrary?

Moron 2010-06-26 12:54:44

You understood it right. The insertion of a bucket comes in fact from the insertion of items at a given position. If you fill up the bucket at this position and the next bucket does not have enough space, you need to insert one or several buckets between them. Likewise it would be great if we could remove buckets when they are empty.

Matthieu M. 2010-06-26 15:46:40

I've followed your suggestion and read about B-Trees. They were originally written for systems with few memory and structures that were kept on disk. So my problem really is a simple transposition here, since I have a small cache and a slow main memory :) Now all I need is to think about the implementation and how to fit them into cache-lines! Thanks.

Matthieu M. 2010-06-27 14:32:18

Answer 2

A:

I still hope for answers, however here is what I could come up so far, following @Moron suggestion.

Apparently my little Fenwick Tree idea cannot be easily adapted. It's easy to append new buckets at the end of the fenwick tree, but not in it the middle, so it's kind of a lost cause.

We're left with 2 data structures: Binary Indexed Trees (ironically the very name Fenwick used to describe his structure) and Ranked Skip List.

Typically, this does not separate the data from the index, however we can get this behavior by:

Use indirection: the element held by the node is a pointer to a bucket, not the bucket itself
Use pool allocation so that the index elements, even though allocated independently from one another, are still close in memory which shall helps the cache

I tend to prefer Skip Lists to Binary Trees because they are self-organizing, so I'm spared the trouble of constantly re-balancing my tree.

These structures would allow to get to the ith element in O(log N), I don't know if it's possible to get faster asymptotic performance.

Another interesting implementation detail is I have a pointer to this element, but others might have been inserted/removed, how do I know the rank of my element now?

It's possible if the bucket points back to the node that owns it. But this means that either the node should not move or it should update the bucket's pointer when moved around.

Matthieu M. 2010-06-27 13:04:54

Sorry, could not help more. For a dynamic array of pointers, where you can insert/delete by position, you can probably use balanced trees with order statistics (basically maintaining count of descendants). Look at my answer here: http://stackoverflow.com/questions/3071497/list-or-container-o1-ish-insertion-deletion-performance-with-array-semantics/3071566#3071566

Moron 2010-06-27 15:01:10

ansaurus

tags:

views:

answers:

Indexing count of buckets

related questions