views:

186

answers:

4

So, suppose you have a collection of items. Each item has an identifier which can be represented using a bitfield. As a simple example, suppose your collection is: 0110, 0111, 1001, 1011, 1110, 1111

So, you then want to implement a function, Remove(bool bitval, int position). For example, a call to Remove(0, 2) would remove all items where index 2(i.e. 3rd bit) was 0. In this case, that would be 1001, only. Remove(1,1) would remove 1110, 1111, 0111, and 0110. It is trivial to come up with an O(n) collection where this is possible (just use a linked list), with n being the number of items in the collection. In general the number of items to be removed is going to be O(n) (assuming a given bit has a ≥ c% chance of being 1 and a ≥ c% chance of being 0, where c is some constant > 0), so "better" algorithms which somehow are O(l), with l being the number of items being removed, are unexciting.

Is it possible to define a data structure where the average (or better yet, worst case) removal time is better than O(n)? A binary tree can do pretty well (just remove all left/right branches at the height m, where m is the index being tested), but I'm wondering if there is any way to do better (and quite honestly, I'm not sure how to removing all left or right branches at a particular height in an efficient manner). Alternatively, is there a proof that doing better is not possible?

Edit: I'm not sure exactly what I'm expecting in terms of efficiency (sorry Arno), but a basic explanation of it's possible application is thus: Suppose we are working with a binary decision tree. Such a tree could be used for a game tree or a puzzle solver or whatever. Further suppose the tree is small enough that we can fit all of the leaf nodes into memory. Each such node is basically just a bitfield listing all of the decisions. Now, if we want to prune arbitrary decisions from this tree, one method would be to just jump to the height where a particular decision is made and prune the left or right side of every node (left meaning one decision, right meaning the other). Normally in a decision tree you only want to prune subtree at a time (since the parent of that subtree is different from the parent of other subtrees and thus the decision which should be pruned in one subtree should not be pruned from others), but in some types of situations this may not be the case. Further, you normally only want to prune everything below a particular node, but in this case you'll be leaving some stuff below the node but also pruning below other nodes in the tree.

Anyhow, this is somewhat of a question based on curiousity; I'm not sure it's practical to use any results, but am interested in what people have to say.

Edit: Thinking about it further, I think the tree method is actually O(n / logn), assuming it's reasonably dense. Proof: Suppose you have a binary tree with n items. It's height is log(n). Removing half the bottom will require n/2 removals. Removing the half the row above will require n/4. The sum of operations for each row is n-1. So the average number of removals is n-1 / log(n).

+1  A: 

Provided the length of your bitfields is limited, the following may work:

  • First, represent the bitfields that are in the set as an array of booleans, so in your case (4 bit bitfields), new bool[16];
  • Transform this array of booleans into a bitfield itself, so a 16-bit bitfield in this case, where each bit represents whether the bitfield corresponding to its index is included

Then operations become:

  • Remove(0, 0) = and with bitmask 1010101010101010
  • Remove(1, 0) = and with bitmask 0101010101010101
  • Remove(0, 2) = and with bitmask 1111000011110000

Note that more complicated 'add/remove' operations could then also be added as O(1) bit-logic.

The only down-side is that extra work is needed to interpret the resulting 16-bit bitfield back into a set of values, but with lookup arrays that might not turn out too bad either.

Addendum:
Additional down-sides:

  • Once the size of an integer is exceeded, every added bit to the original bit-fields will double the storage space. However, this is not much worse than a typical scenario using another collection where you have to store on average half the possible bitmask values (provided the typical scenario doesn't store far less remaining values).

  • Once the size of an integer is exceeded, every added bit also doubles the number of 'and' operations needed to implement the logic.

So basically, I'd say if your original bitfields are not much larger than a byte, you are likely better off with this encoding, beyond that you're probably better off with the original strategy.

Further addendum:
If you only ever execute Remove operations, which over time thins out the set state-space further and further, you may be able to stretch this approach a bit further (no pun intended) by making a more clever abstraction that somehow only keeps track of the int values that are non-zero. Detecting zero values may not be as expensive as it sounds either if the JIT knows what it's doing, because a CPU 'and' operation typically sets the 'zero' flag if the result is zero.

As with all performance optimizations, this one'd need some measurement to determine if it is worthwile.

jerryjvl
Also note that you can extend this mechanism by using multiple ints if you need more than 32/64 combinations. There may be a nice class abstraction to be built on top of this.
jerryjvl
But the length of your bitmask is the same order as number of items in the collection. So, the 'and' operation is still O(N).
Igor Krivokon
No it's not... the CPU can perform an integer AND in a single clock (in most/all modern processors, and still a constant number of clocks on older processors).
jerryjvl
Hmmm. This is an interesting solution, and I can imagine it being much faster than using a linked-list type solution as I suggest in my question. It's definitely cheating, but there's nothing wrong with that :)
Brian
@jerryjvl/Igor: A CPU can perform an integer AND in a single clock on a specific number of bits (that number varies based on chip, but I imagine a 32 bit chip can do it with 32 bits). Hence my remark that it is cheating.
Brian
@Brian: The nature of your question and its conditions really only leaves 'cheating' as an option, since essentially your problem as defined is linear in either the stored number of values or the number of possible values... in which case what you already have is as-good-as-it-gets.
jerryjvl
@jerryjvl: but in common case the size of collection is greater than 32 (or 64). If N is so small, the question about complexity is misleading.
Igor Krivokon
One possible improvement to this is to just maintain a list of removed elements (using a pair of bitfields). When iterating over the list, one can just skip items based on bitwise operations. This works basically the same as yours, but makes removal O(1) in exchange for slowing down lookup and iteration by a constant factor compared to your algorithm. This method has the advantage of needing less space, assuming your set is not filled.
Brian
Ah, I just made a further addition that'd probably improve my solution as well... I think the first step regardless is to put all this behind a class abstraction so that it is easy to measure the trade-offs between different implementations.
jerryjvl
In response to comments about cheating, I analyzed the tree method I proposed further. It's actually O(n / logn), which is sublinear in the average case. Yeah, it's still not really that much better, and it adds enough overhead that these performance benefits are overshadowed until n is large enough that there are bigger problems..
Brian
A: 

If each decision bit and position are listed as objects, {bit value, k-th position}, you would end up with an array of length 2*k. If you link to each of these array positions from your item, represented as a linked list (which are of length k), using a pointer to the {bit, position} object as the node value, you can "invalidate" a bunch of items by simply deleting the {bit, position} object. This would require you, upon searching the list of items, to find "complete" items (it makes search REALLY slow?).

So something like: [{0,0}, {1,0}, {0,1}, {1, 1}, {0,2}, {1, 2}, {0,3}, {1,3}]

and linked from "0100", represented as: {0->3->4->6}

You wouldn't know which items were invalid until you tried to find them (so it doesn't really limit your search space, which is what you're after).

Oh well, I tried.

Jeff Meatball Yang
I'm not sure what you're trying to suggest. Just keeping a pair of bitfields representing which items were removed will accomplish fast removal with no knowledge of what has been removed until you try to look it up).
Brian
A: 

Sure, it is possible (even if this is "cheating"). Just keep a stack of Remove objects:

struct Remove {
    bool set;
    int index;
}

The remove function just pushes an object on the stack. Viola, O(1).

If you wanted to get fancy, your stack couldn't exceed (number of bits) without containing duplicate or impossible scenarios.

The rest of the collection has to apply the logic whenever things are withdrawn or iterated over.

Two ways to do insert into the collection:

  1. Apply the Remove rules upon insert, to clear out the stack, making in O(n). Gotta pay somewhere.
  2. Each bitfield has to store it's index in the remove stack, to know what rules apply to it. Then, the stack size limit above wouldn't matter
Todd Gardner
I've already suggested this, though in my version you just keep two bitfields and test against objects during iteration. That has the advantage of making it cost less when you're forced to actually pay the piper...unless the bitfield is long and there aren't a lot of removals. In truth, I suspect a real implementation probably would use that kind of solution.
Brian
That would lock you into a linear insert, while the above could do a O(1) insert, O(1) remove, and a O(n) iterate over (assuming constant bitfield size)
Todd Gardner
@Todd: After responding with why it didn't lock you into linear insert, I realized that I was relying on the rather odd assumption that no inserts will ever occur after a remove operation...which is kind of a stupid assumption, though in a practical application it may be true.
Brian
A: 

If you use an array to store your binary tree, you can quickly index any element (the children of the node at index n are at index (n+1)2 and (n+1)2-1. All the nodes at a given level are stored sequentially. The first node at at level x is 2^x-1 and there are 2^x elements at that level.

Unfortunately, I don't think this really gets you much of anywhere from a complexity standpoint. Removing all the left nodes at a level is O(n/2) worst case, which is of course O(n). Of course the actual work depends on which bit you are checking, so the average may be somewhat better. This also requires O(2^n) memory which is much worse than the linked list and not practical at all.

I think what this problem is really asking is for a way to efficiently partition a set of sets into two sets. Using a bitset to describe the set gives you a fast check for membership, but doesn't seem to lend itself to making the problem any easier.

Dolphin