views:

100

answers:

2

So, here is my little problem.

Let's say I have a list of buckets a0 ... an which respectively contain L <= c0 ... cn < H items. I can decide of the L and H limits. I could even update them dynamically, though I don't think it would help much.

The order of the buckets matter. I can't go and swap them around.

Now, I'd like to index these buckets so that:

  • I know the total count of items
  • I can look-up the ith element
  • I can add/remove items from any bucket and update the index efficiently

Seems easy right ? Seeing these criteria I immediately thought about a Fenwick Tree. That's what they are meant for really.

However, when you think about the use cases, a few other use cases creep in:

  • if a bucket count drops below L, the bucket must disappear (don't worry about the items yet)
  • if a bucket count reaches H, then a new bucket must be created because this one is full

I haven't figured out how to edit a Fenwick Tree efficiently: remove / add a node without rebuilding the whole tree...

Of course we could setup L = 0, so that removing would become unecessary, however adding items cannot really be avoided.

So here is the question:

Do you know either a better structure for this index or how to update a Fenwick Tree ?

The primary concern is efficiency, and because I do plan to implement it cache/memory considerations are worth worrying about.

Background:

I am trying to come up with a structure somewhat similar to B-Trees and Ranked Skip Lists but with a localized index. The problem of those two structures is that the index is kept along the data, which is inefficient in term of cache (ie you need to fetch multiple pages from memory). Database implementations suggest that keeping the index isolated from the actual data is more cache-friendly, and thus more efficient.

+2  A: 

I have understood your problem as:

Each bucket has an internal order and buckets themselves have an order, so all the elements have some ordering and you need the ith element in that ordering.

To solve that:

What you can do is maintain a 'cumulative value' tree where the leaf nodes (x1, x2, ..., xn) are the bucket sizes. The value of a node is the sum of values of its immediate children. Keeping n a power of 2 will make it simple (you can always pad it with zero size buckets in the end) and the tree will be a complete tree.

Corresponding to each bucket you will maintain a pointer to the corresponding leaf node.

Eg, say the bucket sizes are 2,1,4,8.

The tree will look like

     15
    /  \
   3    12
  / \  / \
 2  1  4  8

If you want the total count, read the value of the root node.

If you want to modify some xk (i.e. change correspond bucket size), you can walk up the tree following parent pointers, updating the values.

For instance if you add 4 items to the second bucket it will be (the nodes marked with * are the ones that changed)

     19*
    /   \
   7*    12
  / \   / \
 2  5*  4  8

If you want to find the ith element, you walk down the above tree, effectively doing the binary search. You already have a left child and right child count. If i > left child node value of current node, you subtract the left child node value and recurse in the right tree. If i <= left child node value, you go left and recurse again.

Say you wanted to find the 9th element in the above tree:

Since left child of root is 7 < 9. You subtract 7 from 9 (to get 2) and go right.

Since 2 < 4 (the left child of 12), you go left.

You are at the leaf node corresponding to the third bucket. You now need to pick the second element in that bucket.

If you have to add a new bucket, you double the size of your tree (if needed) by adding a new root, making the existing tree the left child and add a new tree with all zero buckets except the one you added (which we be the leftmost leaf of the new tree). This will be amortized O(1) time for adding a new value to the tree. Caveat is you can only add a bucket at the end, and not anywhere in the middle.

Getting the total count is O(1). Updating single bucket/lookup of item are O(logn).

Adding new bucket is amortized O(1).

Space usage is O(n).

Instead of a binary tree, you can probably do the same with a B-Tree.

Moron
I'm glad you look interested in the problem :) However if I were to only add buckets at the end or the begin the problem would be somewhat easier. Instead I want to be able to insert buckets right in the middle. Your structure is still interesting in that regard, it looks much like a standard B-Tree. For example I could replace the leaf 4 by a sub-tree `7 (4 3)`, but it will unbalance the tree. You gave me food for thoughts :)
Matthieu M.
@Matthieu: It is an interesting problem :-) So did I understand it right? Also, the insertion, is it next to the bucket you modify or could it be arbitrary?
Moron
You understood it right. The insertion of a bucket comes in fact from the insertion of items at a given position. If you fill up the bucket at this position and the next bucket does not have enough space, you need to insert one or several buckets between them. Likewise it would be great if we could remove buckets when they are empty.
Matthieu M.
I've followed your suggestion and read about B-Trees. They were originally written for systems with few memory and structures that were kept on disk. So my problem really is a simple transposition here, since I have a small cache and a slow main memory :) Now all I need is to think about the implementation and how to fit them into cache-lines! Thanks.
Matthieu M.
A: 

I still hope for answers, however here is what I could come up so far, following @Moron suggestion.

Apparently my little Fenwick Tree idea cannot be easily adapted. It's easy to append new buckets at the end of the fenwick tree, but not in it the middle, so it's kind of a lost cause.

We're left with 2 data structures: Binary Indexed Trees (ironically the very name Fenwick used to describe his structure) and Ranked Skip List.

Typically, this does not separate the data from the index, however we can get this behavior by:

  1. Use indirection: the element held by the node is a pointer to a bucket, not the bucket itself
  2. Use pool allocation so that the index elements, even though allocated independently from one another, are still close in memory which shall helps the cache

I tend to prefer Skip Lists to Binary Trees because they are self-organizing, so I'm spared the trouble of constantly re-balancing my tree.

These structures would allow to get to the ith element in O(log N), I don't know if it's possible to get faster asymptotic performance.

Another interesting implementation detail is I have a pointer to this element, but others might have been inserted/removed, how do I know the rank of my element now?

It's possible if the bucket points back to the node that owns it. But this means that either the node should not move or it should update the bucket's pointer when moved around.

Matthieu M.
Sorry, could not help more. For a dynamic array of pointers, where you can insert/delete by position, you can probably use balanced trees with order statistics (basically maintaining count of descendants). Look at my answer here: http://stackoverflow.com/questions/3071497/list-or-container-o1-ish-insertion-deletion-performance-with-array-semantics/3071566#3071566
Moron