views:

269

answers:

5

I have 2 arrays (A and B) that contain similar data with some differences. I would like to return an array of objects that are only in A and another array of objects that are only in B. So far I have been thinking:

  1. Brute force with some optimizations (this is trivial)
  2. Sort the arrays and use a binary search.

What are my other options. This is not a homework assignment but I think it sounds like one so I tagged it that way. Any languages/solutions are fair game.

+6  A: 

You could sort both arrays then do a linear scan through both arrays at the same time. This would be an O(nlogn) algorithm for the sort and O(n) for the scan / building of the new arrays.

Brian R. Bondy
A: 

Try using Sets. They usually have a difference() method (or something like this) that returns exactly what u want. Simple as that. Once it's language-agnostic, how you create sets or transform the difference into an array is done using general methods.

Set A = createSetA();
Set B = createSetB();

Array onlyAElements = transformToArray(A.difference(B));
Array onlyBElements = transformToArray(B.difference(A));

Alternatively, you could sort both arrays, and get both difference arrays at the same time. Something like

int aIndex = 0;
int bIndex = 0;

Array aOnly = new Array();
Array bOnly = new Array();

while (aIndex != a.length || bIndex != b.length)
{
   if (A[aIndex] == B[bIndex]
   {
       aIndex++;
       bIndex++;
   }
   else if (A[aIndex] > B[bIndex])
   {
       aOnly.add(A[aIndex]);
       aIndex++;
   }
   else 
   {
       bOnly.add(B[bIndex]);
       bIndex++;
   }
}

You should keep in mind there's some mistakes on getting out of bounds. But the code is just to get the main idea.

Samuel Carrijo
Just what I was going to say. Here is Python's sets module for example, where you can use difference() or simply the "-" operator. http://docs.python.org/library/sets.html
MatrixFrog
I think he's looking for the algorithm, which is hidden. There are plenty of easy one-liners for this (think LINQ), but they really don't teach us anything and we have no idea what their efficiency is without reading documentation.
JoshJordan
The two algorithms for sets that I know of are hash sets and tree sets. Google or SO search for those terms.
MatrixFrog
Agree about set, but the question did ask for an algorithm not a data structure. Perhaps you could expand your answer to involve sets.
Brian R. Bondy
Well, the difference() method do all the stuff for you, usually in an efficient way. But I added some code anyway
Samuel Carrijo
A: 

I don't have an implementation or algorithm beyond what's already been said but I thought I would leave this solution in c#/linq for anyone who might find this question and wants to do this:

    var a = new int[] { 1, 2, 3, 6, 7, 8, 9, 10 };
    var b = new int[] { 1, 2, 3, 4, 5, 6, 7 };

    int[] addedToA = a.Except(b);
    int[] missingFromA = b.Except(a);

    foreach (var i in addedToA)
    {
        Console.Write("{0} ", i);
    }
    Console.WriteLine();
    foreach (var i in missingFromA)
    {
        Console.Write("{0} ", i);
    }

This prints:

8 9 10
4 5
Robert Cartaino
+1  A: 

A lot of this is going to depend on what type of data you have. You mention sorting, so I take it the elements are comparable. With sets of size m and n, this will take O(m lg m + n lg n) to sort, and that will dominate. (Asymptotically, it won't matter if you do binary search or walk both lists. Walking both lists should be O( m + n).) Of course, if you are using data with a better sort algorithm available, such as integers with radix-sort, you should be able to get down to O( m + n).

Using sets (as others are suggesting) implicitly suggests using hashing, which will definitely make your problem easier. If you hash all the elements in A ( O(m) ) and store all the hashes in a hash set in memory, then hash B ( O(n) ) and detect where collisions may occur in the hash set. This becomes a matter for optimization: you have to evaluate a classic speed-memory trade-off. The larger your hash set, the quicker the collision-checks will be. This will run in O( m + n ).

It's worth noting that you can prove that any algorithm that does what you ask will run in at least m + n time, since all the inputs need to be looked at.

David Berger
@David: in comparing sorting versus hash table approaches you also need to consider 1) cost of hash function computation vs cost of comparison (optimized for the not-equals case) and 2) whether the hash function gives a good spread.
Stephen C
@Stephen absolutely! I didn't want to get into those considerations because they tend to require assumptions about the inputs which we don't have.
David Berger
+2  A: 

I'd stuff array A's elements into a hash table, then iterate over array B doing lookups in the hash table to efficiently determine which elements in B are also in A. Then do the same with B's elements in a hash table while iterating over array A. That would be O(N) throughout.

Jeremy Friesner
Hash tables tend to generate faster algorithms, but ones that usually occupies most memory. Good answer, by the way
Samuel Carrijo