ansaurus

Question

PHP code to compare two large text files with ~300,000 entries and output differences

Answer 1

A:

function diffLists($listA,$listB) {

  $resultAdded = array();
  $resultRemoved = array();
  foreach($listB AS $item) {
    if (!in_array($item,$listA)) {
       $resultAdded[] = $item;
    }
  }
  foreach($listA AS $item) {
    if (!in_array($item,$listB)) {
      $resultRemoved[] = $item;
    }
  }
  return array($resultAdded,$resultRemoved);
}



$myListA = array('item1','item2','item3');
$myListB = array('item1','item3','item4');
print_r(diffLists($myListA,$myListB));

This should output an array with 2 elements. the first element is a list of items that are ADDED in list B and the second element is a list of items that were REMOVED in list B.

Mailslut 2010-05-06 13:44:28

I'm not sure how would this be better than just using `array_diff()`? You are going to end up making about 600,000 calls to `in_array()` here

Tom Haigh 2010-05-06 14:10:18

@Tom - my knee jerk reaction was to agree with you as each of the foreach blocks in this solution duplicate array_diff. But I did some benchmarking with 10k and 20k element arrays and rather shockingly the foreach loop runs considerably faster than array_diff for sizable inputs. It is still orders of magnitude slower than the sort and search solution I've posted though.

Kevin 2010-05-06 17:08:11

Ah, array_diff is currently bugged in terms of performance: http://bugs.php.net/bug.php?id=47643

Kevin 2010-05-06 17:40:23

Answer 2

A:

You might want try Levenshtein algorithm if you want to this more efficiently,

http://en.wikipedia.org/wiki/Levenshtein_distance

ZZ Coder 2010-05-06 14:09:19

Answer 3

A:

Searching for every value of A in B (and vice-versa) has O(n^2) complexity.

For large amounts of data, you're probably better to sort each of the lists O(n log n), then make a single pass through the sorted lists computing the added/removed elements. (Relatively easy to do, since you know there are no duplicates.)

David Gelhar 2010-05-06 14:22:37

Answer 4

+4 A:

I think the size of arrays is irrelevant unless you really want to focus on how performant this operation is going to be i.e., you are going for a specific number of executions per unit of time.

If you just need to do it to get it done, it seems pretty trivial to me using array_diff()

$a = array( 1, 2, 3, 4 );
$b = array( 1, 3, 5, 7 ); // 2 and 4 removed, 5 and 7 added

$c = array_diff( $b, $a ); // [5, 7]
$d = array_diff( $a, $b ); // [2, 4]

Peter Bailey 2010-05-06 15:21:59

+1 for using the built-in function.

Lucas Jones 2010-05-06 15:25:34

I think you dismiss the relevancy of performance too quickly with this one. If the resultant arrays are ~10k rows each as the OP says, the input arrays would each be at least that big, if not bigger. Try creating 10k element input arrays with $a = range(1,10000) and $b = range(2,20000,2) and running your method, for me it takes ~14s (forever long for a web request). Increase that to 20k element input arrays with $a = range(1,20000) and $b = range(2,40000,2) and it takes 45s. Using the solution I've posted for this question takes 0.02s for the first case and 0.04 for the second.

Kevin 2010-05-06 16:47:48

@Kevin - I understand *all* of that - even before I posted. But who said this was for a web request? You solution is fantastic - doesn't mean you can't go for cheap and brute force for a one-off.

Peter Bailey 2010-05-06 17:11:05

Not to mention that I pretty well qualified where I was coming from in terms of performance. I think it's pretty inaccurate to say "i dismissed it."

Peter Bailey 2010-05-06 17:20:09

@Peter - Fair enough, but the title of the question says 300k row lists. I got bored waiting for array_diff to finish on inputs this size (its still going, 11+ minutes and my cpu is pegged), whereas the solution I posted still took < 2s. I am arguing that the array_diff solution dismisses the relevancy of the size of the arrays too quickly. Even if this is not a web request, who wants to test code that takes so long to run for each test case?

Kevin 2010-05-06 17:23:26

It appears that the speed problem of array_diff is a bug introduced after 5.2.4 (early 2009): http://bugs.php.net/bug.php?id=47643, so my comments will probably be completely irrelevant if/when that bug is ever resolved. The linked ticket does have a workaround that would work fine for this case and would be faster than the solution I provided.

Kevin 2010-05-06 17:39:46

@Kevin good find, and good to know!

Peter Bailey 2010-05-06 17:41:51

+1 cold beer to you Peter!

alex 2010-05-07 04:06:25

Answer 5

+3 A:

The most efficient way to do this will be to sort your lists first and access the elements of your array as few times as possible. An example:

<?php

sort($a, SORT_NUMERIC);
sort($b, SORT_NUMERIC);
$c = array();
$d = array();
while (($currA = array_pop($a)) !== null) {
        while (($currB = array_pop($b)) !== null) {
                if ($currB == $currA) {
                        // exists in both, skip value
                        continue 2;
                }
                if ($currA > $currB) {
                        // exists in A only, add to D, push B back on to stack
                        $d[] = $currA;
                        $b[] = $currB;
                        continue 2;
                }
                // exists in B only, add to C
                $c[] = $currB;
        }
        // exists in A only, for values of A < all of B
        $d[] = $currA;
}

This is going to perform orders of magnitude faster than 2 calls to array_diff even for lists that are only a few hundred elements long.

Kevin 2010-05-06 16:33:56

+1 for speed. I'd only add that this is destructive to the original arrays - so if that matters to the OP he should make copies of them first.

Peter Bailey 2010-05-06 17:26:41

Answer 6

+1 A:

You said you already have two files A and B.

Here's the easiest, fastest solution assuming you're running on a Unix system.

system("comm -13 A B > C");
system("comm -23 A B > D");

//read C and D in PHP

Will Hartung 2010-05-07 04:02:23

stunningly beautiful! works beuatifully for me!! thanks Will!

Dave 2010-05-16 07:23:44

ansaurus

tags:

views:

answers:

PHP code to compare two large text files with ~300,000 entries and output differences

Edit

related questions