views:

67953

answers:

26

I must check approximately 1000 numbers against 1000 other numbers.

I loaded both and compared them server-side:

foreach( $numbers1 as $n1 ) {
  foreach( $numbers2 as $n2 ) {
    if( $n1 == $n2 ) {
      doBla();
    }
  }
}

This took a long time, so I tried to do the same comparison client side using two hidden div elements. Then compared them using JavaScript. It still takes 45 seconds to load the page (using hidden div elements).

I do not need to load the numbers that are not the same.

Is there a faster algorithm? I am thinking of comparing them database side and just load the error numbers, then do an Ajax call for the remaining non-error numbers. But is a MySQL database fast enough?

Thank you, it would help me out very much.

+6  A: 

Sort them first.

JRL
+24  A: 

In database terms this can join of 1000 rows to another 1000 rows. Any modern Db can handle this

select x from table1
inner join table2
on table1.x = table2.y

where table1 and table2 are the rows concerned, and could be the same table

Preet Sangha
+1, as Preet says, be carful that it is a modern Db, say... post 1974 :P
Unreason
I'd love to know how the codasyl dbs would have handled this.
Preet Sangha
Codasyl and IMS would have harder time with chewing on SQL, indeed. 1974 was a reference to the year SQL standard appeared in.
Unreason
i guessed that was what 1974 was for... I was imaging Dr Codd writing a alpha sql dialect on a ims system..... damn it's late.
Preet Sangha
I'm not sure why there's a presumption that the elements for these arrays are coming from a database. Is it because he's demoed an example in PHP and Javascript?The source can be anything, really.
Srdjan Pejic
Yes, and the solution can be anything, really, as it was tagged SQL as well.
haylem
This will *entirely* mitigate the page loading issue, as the database is certainly going to do this much faster than the javascript interpreter, even if it were doing it inefficiently.
Stefan Kendall
+113  A: 
Pointy
Sorting itself will take good time. From question it is NOT clear to me if this one time activity.
RIPUNJAY TRIPATHI
If it is NOT, then.. If A is of size n and B of size m, it will take nlg(n)+mlg(m)+min(m,n) time, while, the OP's approach is just m.n ...
RIPUNJAY TRIPATHI
If *m* and *n* are both large - as stipulated in the question - then the sorts are absolutely, definitely going to be faster. Do the math! Sorting a 1000-element array is about 3000 operations, so that's 3000+3000+1000. But 1000 * 1000 is *100 times as much work*.
Pointy
Yes, but for two arrays of size 1000, that is 1000*lg(1000) + 1000*lg(1000) + 1000, which ~ 2000*10 + 1000, which ~ 21000. The OP's method, m*n time is 1000^2 which is 1000000. 21000 << 1000000. This is why asymptotic evaluation matters - n^2 = o(n*lg(n)), n*lg(n) = O(m*n) , implying that Pointy's method is quite a bit faster, as lg(n) << n
nearlymonolith
Right - oh and I used base-10 log in my example, so that's a little off, but not by much.
Pointy
This code doesn't reproduce what OP's code does. If `x` appeared `x1` times and `x2` times in list 1 and 2, respectively, `doBla()` should be run `x1 * x2` times, which isn't the case here.
Kache4
@Kache4 that would be a fairly simple change - I'll leave it as an exercise for the reader :-)
Pointy
Assuming .sort() is quicksort, then if the data happens to fall into the worst case for quicksort then the sorting may be very slow, not to say that this isn't the best answer but it's worth checking the docs for the .sort method. Also, no one has said the OP should run the various options through a profiler for some realistic datasets, but it is worth doing.
snim2
@snim2 As long as the quicksort is written intelligently (say, with an insertion or other sort taking over for small values of n), even the worst case shouldn't be all that terrible. I would hope any language library quicksort would be written in such a way, but no harm in checking it out.
Nate Bundy
I like this solution the best, actually. It seems to make the most sense.
Srdjan Pejic
why do you need to sort? that's O(n log n). even if you have no built in set operations, populating a hash with $numbers1 and then looping over $numbers2 with an if $n1hash[$n2] then dobla() would be O(n)
Martin DeMello
Hash tables are not magic. They're not *really* constant time.
Pointy
@Martin DeMello has the idea. You get an algorithmically better solution if you just use some sort of set operation with hashes. As soon as you hit a collision, you've found a match. O(2n) = O(n) vs O(2n log n + n) = O( n log n )
Stefan Kendall
Regardless, a server-side solution is probably better.
Stefan Kendall
@Stefan hashing is not constant-time. You have to account for hash collisions, which are not free.
Pointy
@Pointy: I don't think *you* understand hashing. Given sufficient initial memory capacity (10000/50000 would probably do okay, depending on the hashing algorithm), collisions are unlikely. The more memory you use, the less likely it is you'll hit a collision. And the hashing step is pretty-much constant time, as it would likely just use some basic math. Hell, if you knew the range, you could create a boolean array of size [1..N], where N is the maximum possible value. Now you have a better O(n) algorithm in time at the expense of memory.
Stefan Kendall
The hashing solution, then, is the compromise of memory to speed of the [1..N] boolean array approach. Perhaps you should read up on hashing algorithms and collision frequencies before making claims unbacked by experimental or theoretical proof.
Stefan Kendall
@Stefan The point Pointy is making (pun obviously intended) is that you're comparing asymptotic worst case evaluation with asymptotic expected case evaluation. The two are simply disparate - I agree that with a universal hash function, and especially such relatively small arrays that contain numbers from an implicitly bounded range, you would get faster results with the hashing method. However, the worst case time will be equivalent (the case of all collisions but your fallback is some self balancing tree that guarantees log n lookup. Unless you're implying that we use perfect hashing...
nearlymonolith
@Stefan oh I know how hashing works. What about two lists of 50,000 numbers each? 100,000? What if they're 64-bit floating-point values? For small lists (and 1000 really isn't very big), it won't matter as long as something better than polynomial is used. For very large lists, the memory cost of hashing becomes a problem. There are good, old methods for external sorting.
Pointy
None of which are likely to be used by IE's javascript interpreter, which makes my argument for server-side handling stronger. Regardless, 100,000 * 64-bit (8 bytes (unlikely in the javascript interpreter, mind you)) = 800,000 bytes =~ 800KB. Absolutely nothing.
Stefan Kendall
Furthermore, if *any* conditions about the data set are known, you can choose a hashing algorithm that performs well on average for the possible inputs of the data set. In the other extreme case, where the data is perfectly random between 0 and max_int, or whatever, you can just mod the number and ignore the hash-step entirely, as the input will be better than a hash output of that number.
Stefan Kendall
@Stefan yes I agree - with numeric quantities the hash function itself is (or can be) cheap. And all things being equal, I'd do it on the server too, unless it "naturally" belonged in the client. I don't know php however :-)
Pointy
Why do not very smart questions get so many upvotes?
Armen Tsirunyan
It's probably a statement of programmers at large having a poor understanding of algorithms. And not knowing a language is no reason to switch your solution or promote a bad algorithm. Algorithm is independent of language.
Stefan Kendall
+76  A: 
Phill Pafford
however, both would be O(n.log n) at best
dhruvbird
besides, they don't do what the original code does
dhruvbird
@dhruvbird question asked 'I must check approximately 1000 numbers against 1000 other numbers.' the above PHP function do as ask. The returned output should be the desired result where the user could easily manipulate to take advantage of their doBla()
Phill Pafford
"however, both would be O(n.log n) at best" theoretically true, but calling a hashtable lookup O(log n) doesn't really do it justice. In practice it behaves more like O(1)
CodeInChaos
@dhruvbird, O(n log n) is the best you can do for this problem.
luqui
@phill: But the OPs code can potentially print n^2 numbers which implies that the complexity can NOT be less than O(n^2).
dhruvbird
@codeInChaos: I agree with you. If using a hash table for both these operations, the complexity becomes O(n).
dhruvbird
@dhruvbird I agree the with you on the OP's code, but the question might not reflect the code hence the question is asked. Looking at just the question and ignoring the code "How can I compare two sets of 1000 numbers against each other?" I see no problem in using PHP's built in functions to compare to array's and return either same/diff results at which the users nested foreach loops become obsolete.
Phill Pafford
A: 

Would it be possible to put these numbers into two database tables, and then do an INNER JOIN? This will be very efficient and provide only the numbers which are contained in both tables. This is a perfect task for a database.

m.edmondson
Not sure why this was downvoted.
rlb.usa
If these were in RAM then why go through all the trouble to write out to disk, etc.?
Jared Updike
+22  A: 

Maybe just intersect the array values to find numbers existing in both arrays?

$result = array_intersect($numbers1, $numbers2);
foreach ($result as $val)
  doBla();
sod
+2  A: 

Your code is simply more complicated then in needs to be.

Assuming what you're looking for is that the numbers in each position match (and not just that the array contains the same numbers), you can flatten your loop to a single for.

<?php
// Fill two arrays with random numbers as proof.
$first_array = array(1000);
$second_array = array(1000);
for($i=0; $i<1000; $i++) $first_array[$i] = rand(0, 1000);
for($i=0; $i<1000; $i++) $second_array[$i] = rand(0, 1000);

// The loop you care about.
for($i=0; $i<1000; $i++) if ($first_array[$i] != $second_array[$i]) echo "Error at $i: first_array was {$first_array[$i]}, second was {$second_array[$i]}<br>";

?>

Using the code above, you will only loop 1000 times, as opposed to looping 1000000 times.

Now, if you need to just check that a number appears or does not appear in the arrays, use array_diff and array_intersect as follows:

<?php
// Fill two arrays with random numbers as proof.
$first_array = array(1000);
$second_array = array(1000);
for($i=0; $i<1000; $i++) $first_array[$i] = rand(0, 1000);
for($i=0; $i<1000; $i++) $second_array[$i] = rand(0, 1000);

$matches = array_intersect($first_array, $second_array);
$differences = array_diff($first_array, $second_array);

?>
Jack Shedd
+9  A: 

If you sort list2 first and then do a binary search for each number in list1 you'll see a huge speed increase.

I'm not a PHP guy, but this should give you the idea:

sort($numbers2);

foreach($numbers1 as $n1)
{
   if (BinarySearch($numbers2, $n1) >= 0) {
     doBla();
 }
}

Obviously not being a PHP guy I don't know the library, but I'm sure sorting and binary searching should be easy enough to find.

Note: In case you're not familiar with a binary search; you're sorting list2 because binary searches need to operate on sorted lists.

Giovanni Galbo
If the desired behavior is exactly as implemented in the original question, in the case where $numbers1 has three 1's and $numbers2 has two 1's, doBla() would be run 6 times. In this implementation it would be run only 3 times.
RedDeckWins
Unfortunately, there's an error on line 1. PHP's sort() will change $numbers2 in situ, and return either true or false.http://es.php.net/sort
Adam
@Adam Thanks, fixed it. If anyone that knows PHP wants to fix my example go ahead, it's open source ;)
Giovanni Galbo
+4  A: 
$same_numbers = array_intersect($numbers1, $$numbers2);

foreach($same_numbers as $n)
{
  doBla();
}
Cesar
+2  A: 

Maybe I'm not seeing something here but this looks like a classic case of set intersection. Here's a few lines in perl that'll do it.

foreach $e (@a, @b) { $union{$e}++ && $isect{$e}++ }

@union = keys %union; @isect = keys %isect;

At the end of these lines of code @isect will contain all numbers that are in both @a and @b. I'm sure this is translatable to php more or less directly. FWIW, this is my favorite piece of code from the Perl Cookbook.

Shahbaz
+1  A: 

I think it would be much easier to use the built in array_intersect function. Using your example, you could do:

$results = array_intersect($numbers1, $numbers2);
foreach($results as $rk => $rv) {
    doSomething($rv);
}
Dave
+3  A: 

Sort both lists, then walk both lists at the same time using the old-master new-master sequential update pattern. As long as you can sort the data it is the fastest way since your really only walking the list once, to the longest length of the largest list.

skamradt
+1 for pointing to a COBOL reference. :-)
ebneter
@ebneter I find it much more readable than the RPG II reference. :)
skamradt
+1  A: 

A better way would be to do something like this:

// 1. Create a hash map from one of the lists.
var hm = { };
for (var i in list1) {
  if (!hm[list1[i]]) {
    hm[list1[i]] = 1;
  } else { hm[list1[i]] += 1; }
}

// 2. Lookup each element in the other list.
for (var i in list2) {
  if (hm[list2[i]] >= 1) {
    for (var j = 0; j < hm[list2[i]]; ++j) {
      doBla();
    }
  }
}

This is guaranteed O(n) [assuming insertion an lookup in a hash map is O(1) amortized].

Update: The worst case of this algorithm would be O(n2) and there is no way to reduce -- unless you change the semantics of the program. This is because in the worst case, the program will call doBla() n2 number of times if all the numbers in both the lists are the same. However, if both the lists have unique numbers (i.e. generally unique within a list), then the runtime would tend towards O(n).

dhruvbird
If worst case runtime of your algorithm is the same as runtime of the original one, you're losing out on readability. There's not enough detail in the OP to figure out what is actually needed and what is the composition of the arrays, anyway.
Srdjan Pejic
@srdjan: I think it is pretty apparent that the OP wants to call doBlah() whenever a number from array-1 matches a number in array-2. His solution is Theta(n^2) whereas this one is O(n^2) and Omega(n).
dhruvbird
+16  A: 

What you have shouldnt take that long - what does doBla() do? I suspect that is taking the time? Comparing two sets of 1000000 numbers with the same algorithm takes no time at all..

This is hilarious - the number of optimisation techniques as answers - the problem is not your algorithm - it is whatever doBla() does that is taking the time by a factor many times greater than any optimisation would help you :)

Mrk Mnl
I wonder why you got down-voted? You're right, of course - even the brute-force comparison he's got ought to be reasonably fast. He must be calling doBla a lot of times in the typical case, or it takes a long time to execute every time...
Mark Bessey
Thanks, lets see if they find out..
Mrk Mnl
+6  A: 

I'm not a PHP expert, so this may need some debugging, but you can do this easily in O(n) time:

// Load one array into a hashtable, keyed by the number: O(n).
$keys1 = [];
foreach($numbers1 as $n1) $keys1[$n1] = true;

// Find the intersections with the other array:
foreach($numbers2 as $n2) { // O(n)
  if (isset($keys1[$n2]) { // O(1)
     doBla();
  }
}

Regardless, the intersection isn't where your time is going. Even a bad O(n^2) implementation like you have now should be able to go through 1000 numbers in a second.

munificent
A: 

Can you use a centralized file between server and client?

retore
+3  A: 

Stop- why are you doing this?

If the numbers are already in a SQL database, then do a join and let the DB figure out the most efficient route.

If they aren't in a database, then I'm betting you've gone off track somewhere and really ought to reconsider how you got here.

dethSwatch
All we know is he has 2000 numbers, there's endless posibilities as to where and how he got those that doesn't involve a SQL database.
nos
He did mention doing the work in the database, and then wondered if MySQL was sturdy enough to handle the load ;-)
RBerteig
+2  A: 

You can do it in O(n) time if you use bucket sort. Assuming you know the maximum value the numbers can take (although there are ways around that).

http://en.wikipedia.org/wiki/Bucket_sort

ashanan
A: 
  1. Create two duplicate collections, preferably ones with fast lookup times, like HashSet or perhaps TreeSet. Avoid Lists as they have very poor lookup times.

  2. As you find elements, remove them from both sets. This can reduce lookup times by having fewer elements to sift through in later searches.

Edwin Buck
A: 

If you're trying to get a list of numbers without any duplicates, you can use a hash:

$unique = array();
foreach ($list1 as $num) {
  $unique[$num] = $num;
}
foreach ($list2 as $num) {
  $unique[$num] = $num;
}
$unique = array_keys($unique);

It's going to be slightly (very slightly) slower than the array walk method, but it's cleaner in my opinion.

brianloveswords
+1  A: 

I'll create a GUI interface in Visual Basic, see if I can track the numbers

edahs
Desperate times require desperate measures.
slomojo
A: 

Mergesort both lists, start at the beginning of both lists, and then search through each list for similar numbers at the same time.

So, in pseudocode, it would go something like...

Mergesort (List A);
Mergesort (list B)

$Apos = 0;
$Bpos = 0;

while( $Apos != A.Length && $Bpos != B.length) // while you have not reached the end of either list
{
if (A[$Apos] == B[$Bpos])// found a match
doSomething();

else if (A[$Apos] > B[$Bpos]) // B is lower than A, so have B try and catch up to A.
$Bpos++;

else if (A[$Apos] < B[$Bpos]) // the value at A is less than the value at B, so increment B
$Apos++;

}

If I'm right, the speed of this algorithm is O(n logn).

waffles
A: 

I'm not sure why Mrk Mnl was downvoted but the function call is the overhead here.

Push out the matched numbers into another array and doBla() on them after the comparisons. As a test // out doBla() and see if you are experiencing the same performance issue.

Martin Blank
A: 

merge, sort then count

<?php
$first = array('1001', '1002', '1003', '1004', '1005');
$second = array('1002', '1003', '1004', '1005', '1006');
$merged = array_merge($first, $first, $second);
sort($merged);
print_r(array_count_values($merged));
?>

output / the values with a count of three are the ones you want

Array
(
    [1001] => 2
    [1002] => 3
    [1003] => 3
    [1004] => 3
    [1005] => 3
    [1006] => 1
)
jaymz
A: 

This code will call doBla() once for each time a value in $numbers1 is found in $numbers2:

// get [val => occurences, ...] for $numbers2
$counts = array_count_values($numbers2);
foreach ($numbers1 as $n1) {
    // if $n1 occurs in $numbers2...
    if (isset($counts[$n1])) {
        // call doBla() once for each occurence
        for ($i=0; $i < $counts[$n1]; $i++) {
            doBla();
        }
    }
}

If you only need to call doBla() once if a match is found:

foreach ($numbers1 as $n1) {
    if (in_array($n1, $numbers2))
        doBla();
}

If $numbers1 and $numbers2 will only contain unique values, or if the number of times any specific value occurs in both arrays is not important, array_intersect() will do the job:

$dups = array_intersect($numbers1, $numbers2);
foreach ($dups as $n)
    doBla();

I agree with several earlier posts that the calls to doBla() are probably taking more time than iterating over the arrays.

gregjor
A: 

use Bucket Sort for best results

Elad Karako