ansaurus

Question

I have two unordered integer arrays, and i need to know how many integers these arrays have in common

Answer 1

A:

If both arrays came from SQL, could you not write an SQL query with an inner join on the 2 sets of data to get your result?

Patrick McDonald 2009-07-03 12:37:00

i have already this in sql and it's too slow :| so i want to check if there is a possibility to speed up that.The fact is that sql passes through a 2,2 milion records table and this slows a lot

avastreg 2009-07-03 12:41:33

That sounds like a failure to use indexes right. MySQL ought to be able to do this very quickly if there are good indexes and they are used properly.

acrosman 2009-07-03 12:55:06

yes it sounds like but this is a table with two integer fields. The primary key is the composition of the two, and there is a btree index for each of the field; i don't think there are other possibilities.I don't want to enter in details, but this is an operation that i have to do many many many times; it's a batch anyway.

avastreg 2009-07-03 12:57:33

For what it's worth I suggest you try re-asking the problem as an SQL issue (describing what you're doing at that level), and see if there is a better way to handle it at that level.

acrosman 2009-07-03 13:03:29

i think i'll do that

avastreg 2009-07-03 13:41:03

Answer 2

+2 A:

The simplest way would be:

count(array_intersect($array_A, $array_B));

if I understand what you're after. Should be fast.

Greg 2009-07-03 12:37:03

should be fast, but ...isn't ;-)

VolkerK 2009-07-03 13:39:47

this is the answer; anyway for my problem remains slower than SQL; i would assure to me that a better method didn't exist.

avastreg 2009-07-03 13:40:12

Are you saying that this solution is slower than your SQL solution?

Patrick McDonald 2009-07-03 13:45:03

Answer 3

A:

You want the array_intersect() function. From there you can count the result. Don't worry about speed until you know you have a problem. The built-in function execute much faster than anything you'll be able to write in PHP.

acrosman 2009-07-03 12:40:51

Answer 4

+2 A:

I don't know a great deal about PHP so you may get a more specific answer from others, but I'd like to present a more language-agnostic approach.

By checking every element in A against every element in B, it is indeed O(n²) [I'll assume the arrays are of identical length here to simplify the equations but the same reasoning will hold for arrays of differing lengths].

If you were to sort the data in both arrays, you could reduce the time complexity to O(n log n) or similar, depending on the algorithm chosen.

But you need to keep in mind that the complexity only really becomes important for larger data sets. If those two arrays you gave were typical of the size, I would say don't sort it, just use the "compare everything with everything" method - sorting won't give you enough of an advantage over that. Arrays of 50 elements would still only give you 2,500 iterations (whether that's acceptable to PHP, I don't know, it would certainly be water off a duck's back for C and other compiled languages).

And before anyone jumps in and states that you should plan for larger data sets just in case, that's YAGNI, as unnecessary as premature optimization. You may never need it in which case you've wasted time that would have been better spent elsewhere. The time to implement that would be when it became a problem (that's my opinion of course, others may disagree).

If the data sets really are large enough to make the O(n²) unworkable, I think sorting then walking through the arrays in parallel is probably your best bet.

One other possibility is if the range of numbers is not too big - then your proposed solution of a vector of booleans is quite workable since that would be O(n), walking both arrays to populate the vector followed by comparisons of fixed locations within the two vectors. But I'm assuming your range is too large or you wouldn't have already mentioned the 400K requirement. But again, the size of the data sets will dictate whether or not that's worth doing.

paxdiablo 2009-07-03 12:56:44

my range is too large, anyway good analysis :)

avastreg 2009-07-03 13:55:32

Answer 5

+3 A:

Depending on your data (size) you might want to use array_intersect_key() instead of array_intersect(). Apparently the implementation of array_intersect (testing php 5.3) does not use any optimization/caching/whatsoever but loops through the array and compares the values one by one for each element in array A. The hashtable lookup is incredibly faster than that.

<?php
function timefn($fn) {
    static $timer = array();
    if ( is_null($fn) ) {
     return $timer;
    }
    $x = range(1, 120000);
    $y = range(2, 100000);
    foreach($y as $k=>$v) { if (0===$k%3) unset($y[$k]); }

    $s = microtime(true);
    $fn($x, $y);
    $e = microtime(true);

    @$timer[ $fn ] += $e - $s; 
}

function fnIntersect($x, $y) {
    $z = count(array_intersect($x,$y));
}

function fnFlip($x, $y) {
    $x = array_flip($x);
    $y = array_flip($y);
    $z = count(array_intersect_key($x, $y));
}


for ($i=0; $i<3; $i++) {
    timefn( 'fnIntersect' );
    timefn( 'fnFlip' );
}

print_r(timefn(null));

prints

Array
(
    [fnIntersect] => 11.271192073822
    [fnFlip] => 0.54442691802979
)

which means the array_flip/intersect_key method is ~20 times faster on my notebook. (as usual: this is an ad hoc test. If you spot an error, tell me ...I'm expecting that ;-) )

VolkerK 2009-07-03 13:39:19

interesting, but one of the two arrays varies from 70k to 30k elements, so i think the array flip is bottleneck; i tested this on beyond your suggest, but it became a little slower :(

avastreg 2009-07-03 13:57:37

strange, because my test uses ~100k elements for both arrays. It has a higher memory peak, so this _might_ be an issue if your server's memory/heap is under stress. I'll revise the test...

VolkerK 2009-07-03 17:02:41

i mean that it is slower compared to the sql query, not to the application of array_intersect. Sorry for missing this information

avastreg 2009-07-03 17:48:52

ah, i see ;-) phew...and I tried ~30min to come up with a test case that "ticks off" the array_flip method. Yes, I would be surprised if any method that involves transferring _all_ records from the DB server to the PHP process' memory first and comparing them then will be faster than doing it "within" the database.

VolkerK 2009-07-03 18:06:02

ansaurus

tags:

views:

answers:

I have two unordered integer arrays, and i need to know how many integers these arrays have in common

related questions