tags:

views:

170

answers:

6

hi, i've got two lists A and B, B = A + C - D. All elements are unique, no duplicates. How do i get the lists of:
(1) the new items added, C
(2) the old items removed, D

C and D aren't more than 10000 elements or so.

Edit

Crap, sorry guys - forgot the important detail - these are both text files, not in memory elements.

A: 

function diffLists($listA,$listB) {

  $resultAdded = array();
  $resultRemoved = array();
  foreach($listB AS $item) {
    if (!in_array($item,$listA)) {
       $resultAdded[] = $item;
    }
  }
  foreach($listA AS $item) {
    if (!in_array($item,$listB)) {
      $resultRemoved[] = $item;
    }
  }
  return array($resultAdded,$resultRemoved);
}



$myListA = array('item1','item2','item3');
$myListB = array('item1','item3','item4');
print_r(diffLists($myListA,$myListB));

This should output an array with 2 elements. the first element is a list of items that are ADDED in list B and the second element is a list of items that were REMOVED in list B.

Mailslut
I'm not sure how would this be better than just using `array_diff()`? You are going to end up making about 600,000 calls to `in_array()` here
Tom Haigh
@Tom - my knee jerk reaction was to agree with you as each of the foreach blocks in this solution duplicate array_diff. But I did some benchmarking with 10k and 20k element arrays and rather shockingly the foreach loop runs considerably faster than array_diff for sizable inputs. It is still orders of magnitude slower than the sort and search solution I've posted though.
Kevin
Ah, array_diff is currently bugged in terms of performance: http://bugs.php.net/bug.php?id=47643
Kevin
A: 

You might want try Levenshtein algorithm if you want to this more efficiently,

http://en.wikipedia.org/wiki/Levenshtein_distance

ZZ Coder
A: 

Searching for every value of A in B (and vice-versa) has O(n^2) complexity.

For large amounts of data, you're probably better to sort each of the lists O(n log n), then make a single pass through the sorted lists computing the added/removed elements. (Relatively easy to do, since you know there are no duplicates.)

David Gelhar
+4  A: 

I think the size of arrays is irrelevant unless you really want to focus on how performant this operation is going to be i.e., you are going for a specific number of executions per unit of time.

If you just need to do it to get it done, it seems pretty trivial to me using array_diff()

$a = array( 1, 2, 3, 4 );
$b = array( 1, 3, 5, 7 ); // 2 and 4 removed, 5 and 7 added

$c = array_diff( $b, $a ); // [5, 7]
$d = array_diff( $a, $b ); // [2, 4]
Peter Bailey
+1 for using the built-in function.
Lucas Jones
I think you dismiss the relevancy of performance too quickly with this one. If the resultant arrays are ~10k rows each as the OP says, the input arrays would each be at least that big, if not bigger. Try creating 10k element input arrays with $a = range(1,10000) and $b = range(2,20000,2) and running your method, for me it takes ~14s (forever long for a web request). Increase that to 20k element input arrays with $a = range(1,20000) and $b = range(2,40000,2) and it takes 45s. Using the solution I've posted for this question takes 0.02s for the first case and 0.04 for the second.
Kevin
@Kevin - I understand *all* of that - even before I posted. But who said this was for a web request? You solution is fantastic - doesn't mean you can't go for cheap and brute force for a one-off.
Peter Bailey
Not to mention that I pretty well qualified where I was coming from in terms of performance. I think it's pretty inaccurate to say "i dismissed it."
Peter Bailey
@Peter - Fair enough, but the title of the question says 300k row lists. I got bored waiting for array_diff to finish on inputs this size (its still going, 11+ minutes and my cpu is pegged), whereas the solution I posted still took < 2s. I am arguing that the array_diff solution dismisses the relevancy of the size of the arrays too quickly. Even if this is not a web request, who wants to test code that takes so long to run for each test case?
Kevin
It appears that the speed problem of array_diff is a bug introduced after 5.2.4 (early 2009): http://bugs.php.net/bug.php?id=47643, so my comments will probably be completely irrelevant if/when that bug is ever resolved. The linked ticket does have a workaround that would work fine for this case and would be faster than the solution I provided.
Kevin
@Kevin good find, and good to know!
Peter Bailey
+1 cold beer to you Peter!
alex
+3  A: 

The most efficient way to do this will be to sort your lists first and access the elements of your array as few times as possible. An example:

<?php

sort($a, SORT_NUMERIC);
sort($b, SORT_NUMERIC);
$c = array();
$d = array();
while (($currA = array_pop($a)) !== null) {
        while (($currB = array_pop($b)) !== null) {
                if ($currB == $currA) {
                        // exists in both, skip value
                        continue 2;
                }
                if ($currA > $currB) {
                        // exists in A only, add to D, push B back on to stack
                        $d[] = $currA;
                        $b[] = $currB;
                        continue 2;
                }
                // exists in B only, add to C
                $c[] = $currB;
        }
        // exists in A only, for values of A < all of B
        $d[] = $currA;
}

This is going to perform orders of magnitude faster than 2 calls to array_diff even for lists that are only a few hundred elements long.

Kevin
+1 for speed. I'd only add that this is destructive to the original arrays - so if that matters to the OP he should make copies of them first.
Peter Bailey
+1  A: 

You said you already have two files A and B.

Here's the easiest, fastest solution assuming you're running on a Unix system.

system("comm -13 A B > C");
system("comm -23 A B > D");

//read C and D in PHP
Will Hartung
stunningly beautiful! works beuatifully for me!! thanks Will!
Dave