views:

285

answers:

5

I'm working in a LAMP environment, so PHP is the language; at least i can use python.

As the title said i have two unordered integer arrays.

$array_A = array(13, 4, 59, 38, 9, 69, 72, 93, 1, 3, 5)

$array_B = array(29, 72, 21, 3, 6)

I want to know how many integers these array have in common; in the example as you see the result is 2. I'm not interested in what integers are in common, like (72, 3).

I need a faster method than take every element of array B and check if it's in array A ( O(nxm) )

Arrays can be sorted through asort or with sql ordering (they came from a sql result).

An idea that came to me is to create a 'vector' for every array where the integer is a position who gets value 1 and integers not present get 0.

So, for array A (starting at pos 1)

(1, 0, 1, 1, 1, 0, 0, 0, 1, 0, ...)

Same for array B

(0, 0, 1, 0, 0, 1, ...)

And then compare this two vectors with one cycle. The problem is that in this way the vector length is about 400k.

A: 

If both arrays came from SQL, could you not write an SQL query with an inner join on the 2 sets of data to get your result?

Patrick McDonald
i have already this in sql and it's too slow :| so i want to check if there is a possibility to speed up that.The fact is that sql passes through a 2,2 milion records table and this slows a lot
avastreg
That sounds like a failure to use indexes right. MySQL ought to be able to do this very quickly if there are good indexes and they are used properly.
acrosman
yes it sounds like but this is a table with two integer fields. The primary key is the composition of the two, and there is a btree index for each of the field; i don't think there are other possibilities.I don't want to enter in details, but this is an operation that i have to do many many many times; it's a batch anyway.
avastreg
For what it's worth I suggest you try re-asking the problem as an SQL issue (describing what you're doing at that level), and see if there is a better way to handle it at that level.
acrosman
i think i'll do that
avastreg
+2  A: 

The simplest way would be:

count(array_intersect($array_A, $array_B));

if I understand what you're after. Should be fast.

Greg
should be fast, but ...isn't ;-)
VolkerK
this is the answer; anyway for my problem remains slower than SQL; i would assure to me that a better method didn't exist.
avastreg
Are you saying that this solution is slower than your SQL solution?
Patrick McDonald
A: 

You want the array_intersect() function. From there you can count the result. Don't worry about speed until you know you have a problem. The built-in function execute much faster than anything you'll be able to write in PHP.

acrosman
+2  A: 

I don't know a great deal about PHP so you may get a more specific answer from others, but I'd like to present a more language-agnostic approach.

By checking every element in A against every element in B, it is indeed O(n2) [I'll assume the arrays are of identical length here to simplify the equations but the same reasoning will hold for arrays of differing lengths].

If you were to sort the data in both arrays, you could reduce the time complexity to O(n log n) or similar, depending on the algorithm chosen.

But you need to keep in mind that the complexity only really becomes important for larger data sets. If those two arrays you gave were typical of the size, I would say don't sort it, just use the "compare everything with everything" method - sorting won't give you enough of an advantage over that. Arrays of 50 elements would still only give you 2,500 iterations (whether that's acceptable to PHP, I don't know, it would certainly be water off a duck's back for C and other compiled languages).

And before anyone jumps in and states that you should plan for larger data sets just in case, that's YAGNI, as unnecessary as premature optimization. You may never need it in which case you've wasted time that would have been better spent elsewhere. The time to implement that would be when it became a problem (that's my opinion of course, others may disagree).

If the data sets really are large enough to make the O(n2) unworkable, I think sorting then walking through the arrays in parallel is probably your best bet.

One other possibility is if the range of numbers is not too big - then your proposed solution of a vector of booleans is quite workable since that would be O(n), walking both arrays to populate the vector followed by comparisons of fixed locations within the two vectors. But I'm assuming your range is too large or you wouldn't have already mentioned the 400K requirement. But again, the size of the data sets will dictate whether or not that's worth doing.

paxdiablo
my range is too large, anyway good analysis :)
avastreg
+3  A: 

Depending on your data (size) you might want to use array_intersect_key() instead of array_intersect(). Apparently the implementation of array_intersect (testing php 5.3) does not use any optimization/caching/whatsoever but loops through the array and compares the values one by one for each element in array A. The hashtable lookup is incredibly faster than that.

<?php
function timefn($fn) {
    static $timer = array();
    if ( is_null($fn) ) {
     return $timer;
    }
    $x = range(1, 120000);
    $y = range(2, 100000);
    foreach($y as $k=>$v) { if (0===$k%3) unset($y[$k]); }

    $s = microtime(true);
    $fn($x, $y);
    $e = microtime(true);

    @$timer[ $fn ] += $e - $s; 
}

function fnIntersect($x, $y) {
    $z = count(array_intersect($x,$y));
}

function fnFlip($x, $y) {
    $x = array_flip($x);
    $y = array_flip($y);
    $z = count(array_intersect_key($x, $y));
}


for ($i=0; $i<3; $i++) {
    timefn( 'fnIntersect' );
    timefn( 'fnFlip' );
}

print_r(timefn(null));

prints

Array
(
    [fnIntersect] => 11.271192073822
    [fnFlip] => 0.54442691802979
)
which means the array_flip/intersect_key method is ~20 times faster on my notebook. (as usual: this is an ad hoc test. If you spot an error, tell me ...I'm expecting that ;-) )

VolkerK
interesting, but one of the two arrays varies from 70k to 30k elements, so i think the array flip is bottleneck; i tested this on beyond your suggest, but it became a little slower :(
avastreg
strange, because my test uses ~100k elements for both arrays. It has a higher memory peak, so this _might_ be an issue if your server's memory/heap is under stress. I'll revise the test...
VolkerK
i mean that it is slower compared to the sql query, not to the application of array_intersect. Sorry for missing this information
avastreg
ah, i see ;-) phew...and I tried ~30min to come up with a test case that "ticks off" the array_flip method. Yes, I would be surprised if any method that involves transferring _all_ records from the DB server to the PHP process' memory first and comparing them then will be faster than doing it "within" the database.
VolkerK