views:

206

answers:

7

I have two Sets. Set b is the subset of Set a. they're both very huge Sets. I want to subtract b from a , what's the best practice to do this common operation ? I've written to many codes like this , and I don't think it's efficient. what's your idea ?

pseudo code : (this is not Java API) .

for(int i = 0 ; i < a.size(); i++) {
          for (int j=0 ; j < b.size() ;j++) {
              // do comparison , if found equals ,remove from a
              break;
          }
 }

And I want to find an algorithm , not only applies to Sets, also works for Array.

EDIT: The Set here is not JAVA API , it's a data structure . so I don't care if Java API has a removeAll() method , I want find a common solution for this problem , I've encounter a lot of problems like this when I use Javascript and Actionscript .

+5  A: 

I don't think you will get it much faster, but your code will look simpler and will not become slower by a.removeAll(b);. removeAll() is part of the Java-API.

For efficiency analysis: Your given code example is O(n^2), that scales not very good, but also isn't the horriblest thing on earth (exponential complexity is the thing you don't want). As long as you don't know the internal organisation of the data in the Collection, you will not get a better performance. removeAll() is implemented by the class itself and knows about the internal organisation. So if the data is organized in a Hash, you may get better results, if the data is organized in an unsorted array, the complexity will be the same. A Set have to efficently lookup if a new item is already in the set, so I suspect some sort of Hash as internal representation, especially if the implementation is called HashSet. :-)

EDIT: The OP changed it's question to mention it is not only for Java. removeAll() is a Java-API, so this (or something similar) may not be available in other languages. As said before, if the collections are unsorted arrays with no other restrictions the two for-loops are already the fastest solution. But if the data is organized different you have faster options. If the two collections are sorted data (in my example comes the smallest element first), you can do the following (reducing the complexity to O(n)):

int bIndex = 0;
for(int i = 0 ; i < a.size(); i++) {
          while (a[i] < b[bIndex]) {bIndex++;}
          if (a[i] == b[bIndex]) {markForRemoval(a[i]);} // I mark this only for removal, as the actual removal would make your index incorrect
}

If the data is organized as a hash in both collections you also need only one for-loop, accessing directly the element in b. Other possible organizations of data are possible.

Mnementh
A: 

You have seen the removeAll method in the Set interface?

Also check out this stack overflow question.

extraneon
A: 

I believe you'll find java.util.HashSet.removeAll(Collection toRemove) to perform well. On the other hand, if you don't have sets but sorted collections, you might be able to do a lot better.

Tomislav Nakic-Alfirevic
Indeed the performance should be better with a hashtable, BST or other collection type optimized for random access.
Bart van Heukelom
+1  A: 

In the end, there's not much of a choice other than to one by one compare elements and remove the ones which are in both.

To do it another way, you'd have to do something fancy like give all set members a unique value index, and construct a huge array of booleans representing each set, and then you could do bit operations to subtract B from A. I have no idea if that would be faster, given the overhead of creating unique value indices and manipulating the very large bitmasks.

I know you don't care about a Java solution, but since other people have recommended removeAll(), I'd like to point out that it's still doing essentially the same thing under the covers. Check the source for HashSet.

CPerkins
But I see none of fast sorting algorithms iterate collections like this , only bubble sort, it's not fast enough and someone says it should be deprecated.
ZZcat
Correct, mostly removeAll() should do the same thing. But it is simpler and easier to read in code, and some removeAll-implementation could use better organisation of internal data, especially in a Set. A Set should use some sort of fast random-access, to quick decide if an element is already present. The most simply method is to sort the entries, and even this would reduce the complexity of the operation to O(n) (only one iteration through both collections is needed).
Mnementh
@Mnementh: Is is possible to reduce the complexities of two int[] arrays comparison to O(n) ?
ZZcat
@Tony: If the elements in the arrays are sorted, you can iterate through both in one loop.
Mnementh
@Tony and @Mnementh - I wasn't proposing a sorted array, but a pair of very large bitmasks using arrays - which would make them more positional than sorted, if you follow me. But then, yes, you iterate through both in one loop, doing bit operations on each chunk.
CPerkins
@CPerkins: It would be very glad to see your implementations of int[] comparison using bitmask . please :)
ZZcat
@CPerkins: I guess the method you're going to use is like this:http://www.ugrad.cs.ubc.ca/~cs490/sec202/notes/intro/bitmask.pdf
ZZcat
@Tony - yes, something like the ubc intro lecture, but that's a true bitmask... which are limited to just a few bytes worth of bits, which limits your set sizes. You said they were very large, so 32 or 64 positions just won't cut it.
CPerkins
+1  A: 
Pointy
+1  A: 

well, the correct idea was already pointed out: the set should be implemented using a hash. hashes ideally have O(1) access cost, so you can get O(min(m,n)) cost for the overall operation assuming you can determine which set is bigger (like maintaining a counter during insert/remove operations).

in actionscript 3, you'd use a Dictionary. just use elements as keys and values.

removing looks like this:

for each (var key:* in set2) {//a simple for-in loop will also do the trick, since keys and values are equal, but for-each-in loops perform faster
    delete set1[key];
}

in JavaScript, you will need to give the entries ids when inserting, so you can use those ids as keys in a map. simply map ids to original values.

removing looks like this:

for (var key in set2) {
    delete set1[key];
}

greetz

back2dos

back2dos
+1  A: 

Given that b is a subset of a I'm not sure why your pseudo-code has 2 loops. Mine would simply be:

foreach b in B
    remove b from A

In practice how the run time of this compares with the run time of yours depends on, among other things, how you have implemented the set as a data structure.

High Performance Mark
very inspirational.
ZZcat