tags:

views:

108

answers:

5

I have two sets A and B.

A
--
1
2
6

B
--
1
2
3
4

When I compare set A with B, I need to get value 6 as output and value 4 as output when set B is compared against A.

I am wondering what would be the best algorithm to do this? I have wrote one but it has got a quadratic complexity. It basically iterate one set and inside the loop iterate the second set to check the value existence. I felt this as inefficient.

Context

I have a set of values in the database which I am showing in the UI. Users can remove or add new items to the list and press "Save changes" button which will persist all the changes to database. So here I need to insert newly added items to the database and delete removed items.

So I pass the first set which will have items that are newly added and already existing. I load another set which will have all the items from database. Now if I apply the above algorithm to compare Set A (new list) with Set B (database list) and take items that exist in SetA and not in SetB, I get all the newly added items. SetB will be then compared against SetA and all the items that exist in setB and not exist in SetA will be the deleted ones. I am open to suggestions for a better algorithm.

Any help would be great.

A: 

In Python

>>> A=set((1,2,6))
>>> B=set((1,2,3,4))
>>> A-B
set([6])
>>> B-A
set([3, 4])

Assuming you don't have a builtin set type
Psudocode:

# This computes the items of B that are not in A
a=hash(A)   # Hopefully you at least have some sort of hash type
result=[]   #empty list
for item in B:
    if item not in a:
        result.append(item)
gnibbler
Thanks. I am trying to do this in C#. I couldn't find a predefined algorithm that does set difference.
Appu
here is a discussion of set alternatives for C# http://stackoverflow.com/questions/183685/c-set-collection
gnibbler
If your C# is new enough you can use HashSet http://msdn.microsoft.com/en-us/library/bb299875.aspx
gnibbler
+1  A: 

If both sets are sorted one can start at the beginning of both sets and walk through them, comparing the first elements to see which ones are missing in the other set. This works in linear time.

For unsorted sets, first sorting them in O(n*log(n)) time and then comparing them in linear time gives a total time complexity of O(n*log(n)). Depending on the details of your application it might also be possible to just keep the sets sorted all the time, so making it easy to compare them when needed.

sth
A: 

Here is an answer from microsoft. Looks O(n2) to me though

class CompareLists
{        
    static void Main()
    {
        // Create the IEnumerable data sources.
        string[] names1 = System.IO.File.ReadAllLines(@"../../../names1.txt");
        string[] names2 = System.IO.File.ReadAllLines(@"../../../names2.txt");

        // Create the query. Note that method syntax must be used here.
        IEnumerable<string> differenceQuery =
          names1.Except(names2);

        // Execute the query.
        Console.WriteLine("The following lines are in names1.txt but not names2.txt");
        foreach (string s in differenceQuery)
            Console.WriteLine(s);

        // Keep the console window open in debug mode.
        Console.WriteLine("Press any key to exit");
        Console.ReadKey();
    }
}
/* Output:
     The following lines are in names1.txt but not names2.txt
    Potra, Cristina
    Noriega, Fabricio
    Aw, Kam Foo
    Toyoshima, Tim
    Guy, Wey Yuan
    Garcia, Debra
     */
gnibbler
A: 

You could put both sets into balanced binary trees. Searching for an element in one set against another set is O(log n). Thus, searching for n' elements in one set against another set is then O(n' log n) or just O(n log n).

If both sets are made into sorted arrays, you can iterate through both arrays in step-like fashion in O(n + n') or O(n) time, to identify if an element in either set is missing.

Alex Reynolds
A: 

If you have access to a hash-set implementation (I believe Java, C#, and Python all have them), you can just construct two sets, A and B and take the set difference. If set difference isn't defined, you can just iterate over the elements of A and check to see if B has each one or not. A hash set is implemented with a hash table, so it can be constructed in linear time and membership can be tested in constant time. That means that the total time will be linear in the sum of the set sizes.

PeterAllenWebb