tags:

views:

198

answers:

6

I have two arrays, say A and B with |A|=8 and |B|=4. I want to calculate the set difference A-B. How do I proceed? Please note that there are no repeated elements in either of the sets.

Edit: Thank you so much everybody for a myriad of elegant solutions. Since I am in prototyping stage of my project, for now I implemented the simplest solution told by Brian and Owen. But I do appreciate the clever use of data structures as suggested here by the rest of you, even Though I am not a computer scientist but an engineer and never studied data structures as a course. Looks like it's about time I should really start reading CLRS which I have been procrastinating for quite a while :) Thanks again!

+3  A: 

Iterate over each element of A, if each of those elements are not in B, then add them to a new set C.

Brian R. Bondy
and how do I implement "if each of those elements are not in B"? This is exactly the point that I couldn't seem to get!
Aamir
@Aamir - you can either iterate over `B` if the set is unordered (giving you an `O(n*m)` runtime) or you can do a binary search on `B` if the set is ordered (giving you an `O(n log m)` runtime)
R Samuel Klatchko
This will do it (sorry about the formatting): int foundInB = 0; for (int j=0; j<blen; ++j) { if (a[i] == b[j]) { foundInB = 1; break; } } Here, you're checking whether the i'th index of a is in b.
Owen S.
Best if you can load B into a hash table so the "is it in B" test can be done in O(1).
Keith Randall
+2  A: 

It depends on how you want to represent your sets, but if they are just packed bits then you can use bitwise operators, e.g. D = A & ~B; would give you the set difference A-B if the sets fit into an integer type. For larger sets you might use arrays of integer types and iterate, e.g.

for (i = 0; i < N; ++i)
{
    D[i] = A[i] & ~B[i];
}
Paul R
+2  A: 

sort arrays A and B
result will be in C
let a - the first elem of A
let b - the first elem of B
then:
1) while a < b: insert a into C and a = next elem of A
2) while a > b: b = next elem of B
3) if a = b: a = next elem of A and b = next elem of B
4) if b goes to end: insert rest of A into C and stop
5) if a goes to end: stop

oraz
A: 

For larger sets I'd suggest sorting the numbers and iterating through them by emulating the code at http://www.cplusplus.com/reference/algorithm/set_difference/ which would be O(N*logN), but since the set sizes are so small, the solution given by Brian seems fine even though it's theoretically slower at O(N^2).

tsiki
The set difference you linked should be O(n), not O(n log n) - so long as the copy operation doesn't just do a bunch on inserts into a new tree. A well-written subrange copy for a binary tree is O(n).
Steve314
Ah I forgot to specify that I meant O(NlogN) assuming quicksort is used in the sorting phase).
tsiki
+2  A: 

The following assumes the sets are stored as a sorted container (as std::set does).

There's a common algorithm for merging two ordered lists to produce a third. The idea is that when you look at the heads of the two lists, you can determine which is the lower, extract that, and add it to the tail of the output, then repeat.

There are variants which detect the case where the two heads are equal, and treat this specially. Set intersections and unions are examples of this.

With a set asymmetric difference, the key point is that for A-B, when you extract the head of B, you discard it. When you extract the head of A, you add it to the input unless the head of B is equal, in which case you extract that too and discard both.

Although this approach is designed for sequential-access data structures (and tape storage etc), it's sometimes very useful to do the same thing for a random-access data structure so long as it's reasonably efficient to access it sequentially anyway. And you don't necessarily have to extract things for real - you can do copying and step instead.

The key point is that you step through the inputs sequentially, always looking at the lowest remaining value next, so that (if the inputs have no duplicates) you will the matched items. You therefore always know whether your next lowest value to handle is an item from A with no match in B, and item in B with no match in A, or an item that's equal in both A and B.

More generally, the algorithm for the set difference depends on the representation of the set. For example, if the set is represented as a bit-vector, the above would be overcomplex and slow - you'd just loop through the vectors doing bitwise operations. If the set is represented as a hashtable (as in the tr1 unordered_set) the above is wrong as it requires ordered inputs.

If you have your own binary tree code that you're using for the sets, one good option is to convert both trees into linked lists, work on the lists, then convert the resulting list to a perfectly balanced tree. The linked-list set-difference is very simple, and the two conversions are re-usable for other similar operations.

EDIT

On the complexity - using these ordered merge-like algorithms is O(n) provided you can do the in-order traversals in O(n). Converting to a list and back is also O(n) as each of the three steps is O(n) - tree-to-list, set-difference and list-to-tree.

Tree-to-list basically does a depth-first traversal, deconstructing the tree as it goes. Theres a trick for making this iterative, storing the "stack" in part-handled nodes - changing a left-child pointer into a parent-pointer just before you step to the left child. This is a good idea if the tree may be large and unbalanced.

Converting a list to a tree basically involves a depth-first traversal of an imaginary tree (based on the size, known from the start) building it for real as you go. If a tree has 5 nodes, for instance, you can say that the root will be node 3. You recurse to build a two-node left subtree, then grab the next item from the list for that root, then recurse to build a two-node right subtree.

The list-to-tree conversion shouldn't need to be implemented iteratively - recursive is fine as the result is always perfectly balanced. If you can't handle the log n recursion depth, you almost certainly can't handle the full tree anyway.

Steve314
Some relevant pseudo- or C-code example would top this off.
MizardX
A: 

Implement a set object in C. You can do it using a hash table for the underlying storage. This is obviously a non trivial exercise, but a few Open Source solutions exist. Then you simply need to add all the elements of A and then iterate over B and remove any that are elements of your set.

The key point is to use the right data structure for the job.

JeremyP