ansaurus

Question

How can I compare two sets of 1000 numbers against each other?

Answer 1

+6 A:

Sort them first.

JRL 2010-10-15 13:18:15

Answer 2

+24 A:

In database terms this can join of 1000 rows to another 1000 rows. Any modern Db can handle this

select x from table1
inner join table2
on table1.x = table2.y

where table1 and table2 are the rows concerned, and could be the same table

Preet Sangha 2010-10-15 13:18:57

+1, as Preet says, be carful that it is a modern Db, say... post 1974 :P

Unreason 2010-10-15 13:40:41

I'd love to know how the codasyl dbs would have handled this.

Preet Sangha 2010-10-15 13:43:13

Codasyl and IMS would have harder time with chewing on SQL, indeed. 1974 was a reference to the year SQL standard appeared in.

Unreason 2010-10-15 13:53:00

i guessed that was what 1974 was for... I was imaging Dr Codd writing a alpha sql dialect on a ims system..... damn it's late.

Preet Sangha 2010-10-15 14:07:05

I'm not sure why there's a presumption that the elements for these arrays are coming from a database. Is it because he's demoed an example in PHP and Javascript?The source can be anything, really.

Srdjan Pejic 2010-10-15 18:18:54

Yes, and the solution can be anything, really, as it was tagged SQL as well.

haylem 2010-10-15 18:41:48

This will *entirely* mitigate the page loading issue, as the database is certainly going to do this much faster than the javascript interpreter, even if it were doing it inefficiently.

Stefan Kendall 2010-10-15 19:51:18

Answer 3

+113 A:

Pointy 2010-10-15 13:19:17

Sorting itself will take good time. From question it is NOT clear to me if this one time activity.

RIPUNJAY TRIPATHI 2010-10-15 14:46:33

If it is NOT, then.. If A is of size n and B of size m, it will take nlg(n)+mlg(m)+min(m,n) time, while, the OP's approach is just m.n ...

RIPUNJAY TRIPATHI 2010-10-15 14:52:57

If *m* and *n* are both large - as stipulated in the question - then the sorts are absolutely, definitely going to be faster. Do the math! Sorting a 1000-element array is about 3000 operations, so that's 3000+3000+1000. But 1000 * 1000 is *100 times as much work*.

Pointy 2010-10-15 14:59:41

Yes, but for two arrays of size 1000, that is 1000*lg(1000) + 1000*lg(1000) + 1000, which ~ 2000*10 + 1000, which ~ 21000. The OP's method, m*n time is 1000^2 which is 1000000. 21000 << 1000000. This is why asymptotic evaluation matters - n^2 = o(n*lg(n)), n*lg(n) = O(m*n) , implying that Pointy's method is quite a bit faster, as lg(n) << n

nearlymonolith 2010-10-15 15:02:14

Right - oh and I used base-10 log in my example, so that's a little off, but not by much.

Pointy 2010-10-15 15:30:13

This code doesn't reproduce what OP's code does. If `x` appeared `x1` times and `x2` times in list 1 and 2, respectively, `doBla()` should be run `x1 * x2` times, which isn't the case here.

Kache4 2010-10-15 15:50:08

@Kache4 that would be a fairly simple change - I'll leave it as an exercise for the reader :-)

Pointy 2010-10-15 16:15:32

Assuming .sort() is quicksort, then if the data happens to fall into the worst case for quicksort then the sorting may be very slow, not to say that this isn't the best answer but it's worth checking the docs for the .sort method. Also, no one has said the OP should run the various options through a profiler for some realistic datasets, but it is worth doing.

snim2 2010-10-15 16:40:33

@snim2 As long as the quicksort is written intelligently (say, with an insertion or other sort taking over for small values of n), even the worst case shouldn't be all that terrible. I would hope any language library quicksort would be written in such a way, but no harm in checking it out.

Nate Bundy 2010-10-15 17:52:33

I like this solution the best, actually. It seems to make the most sense.

Srdjan Pejic 2010-10-15 18:20:05

why do you need to sort? that's O(n log n). even if you have no built in set operations, populating a hash with $numbers1 and then looping over $numbers2 with an if $n1hash[$n2] then dobla() would be O(n)

Martin DeMello 2010-10-15 18:46:43

Hash tables are not magic. They're not *really* constant time.

Pointy 2010-10-15 19:39:39

@Martin DeMello has the idea. You get an algorithmically better solution if you just use some sort of set operation with hashes. As soon as you hit a collision, you've found a match. O(2n) = O(n) vs O(2n log n + n) = O( n log n )

Stefan Kendall 2010-10-15 19:49:36

Regardless, a server-side solution is probably better.

Stefan Kendall 2010-10-15 19:51:45

@Stefan hashing is not constant-time. You have to account for hash collisions, which are not free.

Pointy 2010-10-15 19:58:06

@Pointy: I don't think *you* understand hashing. Given sufficient initial memory capacity (10000/50000 would probably do okay, depending on the hashing algorithm), collisions are unlikely. The more memory you use, the less likely it is you'll hit a collision. And the hashing step is pretty-much constant time, as it would likely just use some basic math. Hell, if you knew the range, you could create a boolean array of size [1..N], where N is the maximum possible value. Now you have a better O(n) algorithm in time at the expense of memory.

Stefan Kendall 2010-10-15 20:46:26

The hashing solution, then, is the compromise of memory to speed of the [1..N] boolean array approach. Perhaps you should read up on hashing algorithms and collision frequencies before making claims unbacked by experimental or theoretical proof.

Stefan Kendall 2010-10-15 20:47:56

@Stefan The point Pointy is making (pun obviously intended) is that you're comparing asymptotic worst case evaluation with asymptotic expected case evaluation. The two are simply disparate - I agree that with a universal hash function, and especially such relatively small arrays that contain numbers from an implicitly bounded range, you would get faster results with the hashing method. However, the worst case time will be equivalent (the case of all collisions but your fallback is some self balancing tree that guarantees log n lookup. Unless you're implying that we use perfect hashing...

nearlymonolith 2010-10-15 21:35:00

@Stefan oh I know how hashing works. What about two lists of 50,000 numbers each? 100,000? What if they're 64-bit floating-point values? For small lists (and 1000 really isn't very big), it won't matter as long as something better than polynomial is used. For very large lists, the memory cost of hashing becomes a problem. There are good, old methods for external sorting.

Pointy 2010-10-15 21:47:38

None of which are likely to be used by IE's javascript interpreter, which makes my argument for server-side handling stronger. Regardless, 100,000 * 64-bit (8 bytes (unlikely in the javascript interpreter, mind you)) = 800,000 bytes =~ 800KB. Absolutely nothing.

Stefan Kendall 2010-10-15 21:58:52

Furthermore, if *any* conditions about the data set are known, you can choose a hashing algorithm that performs well on average for the possible inputs of the data set. In the other extreme case, where the data is perfectly random between 0 and max_int, or whatever, you can just mod the number and ignore the hash-step entirely, as the input will be better than a hash output of that number.

Stefan Kendall 2010-10-15 22:01:29

@Stefan yes I agree - with numeric quantities the hash function itself is (or can be) cheap. And all things being equal, I'd do it on the server too, unless it "naturally" belonged in the client. I don't know php however :-)

Pointy 2010-10-15 22:24:03

Why do not very smart questions get so many upvotes?

Armen Tsirunyan 2010-10-16 13:46:51

It's probably a statement of programmers at large having a poor understanding of algorithms. And not knowing a language is no reason to switch your solution or promote a bad algorithm. Algorithm is independent of language.

Stefan Kendall 2010-10-18 19:26:08

Answer 4

+76 A:

Array Intersect returns matches
Array Diff returns diffs

Phill Pafford 2010-10-15 13:19:30

however, both would be O(n.log n) at best

dhruvbird 2010-10-15 17:13:47

besides, they don't do what the original code does

dhruvbird 2010-10-15 17:49:01

@dhruvbird question asked 'I must check approximately 1000 numbers against 1000 other numbers.' the above PHP function do as ask. The returned output should be the desired result where the user could easily manipulate to take advantage of their doBla()

Phill Pafford 2010-10-15 18:33:42

"however, both would be O(n.log n) at best" theoretically true, but calling a hashtable lookup O(log n) doesn't really do it justice. In practice it behaves more like O(1)

CodeInChaos 2010-10-15 21:32:49

@dhruvbird, O(n log n) is the best you can do for this problem.

luqui 2010-10-16 11:09:56

@phill: But the OPs code can potentially print n^2 numbers which implies that the complexity can NOT be less than O(n^2).

dhruvbird 2010-10-21 20:46:34

@codeInChaos: I agree with you. If using a hash table for both these operations, the complexity becomes O(n).

dhruvbird 2010-10-21 20:47:50

@dhruvbird I agree the with you on the OP's code, but the question might not reflect the code hence the question is asked. Looking at just the question and ignoring the code "How can I compare two sets of 1000 numbers against each other?" I see no problem in using PHP's built in functions to compare to array's and return either same/diff results at which the users nested foreach loops become obsolete.

Phill Pafford 2010-10-22 14:02:33

Answer 5

A:

Would it be possible to put these numbers into two database tables, and then do an INNER JOIN? This will be very efficient and provide only the numbers which are contained in both tables. This is a perfect task for a database.

m.edmondson 2010-10-15 13:20:08

Not sure why this was downvoted.

rlb.usa 2010-10-15 20:44:51

If these were in RAM then why go through all the trouble to write out to disk, etc.?

Jared Updike 2010-10-15 21:04:09

Answer 6

+22 A:

Maybe just intersect the array values to find numbers existing in both arrays?

$result = array_intersect($numbers1, $numbers2);
foreach ($result as $val)
  doBla();

sod 2010-10-15 13:21:08

Answer 7

+2 A:

Your code is simply more complicated then in needs to be.

Assuming what you're looking for is that the numbers in each position match (and not just that the array contains the same numbers), you can flatten your loop to a single for.

<?php
// Fill two arrays with random numbers as proof.
$first_array = array(1000);
$second_array = array(1000);
for($i=0; $i<1000; $i++) $first_array[$i] = rand(0, 1000);
for($i=0; $i<1000; $i++) $second_array[$i] = rand(0, 1000);

// The loop you care about.
for($i=0; $i<1000; $i++) if ($first_array[$i] != $second_array[$i]) echo "Error at $i: first_array was {$first_array[$i]}, second was {$second_array[$i]}<br>";

?>

Using the code above, you will only loop 1000 times, as opposed to looping 1000000 times.

Now, if you need to just check that a number appears or does not appear in the arrays, use array_diff and array_intersect as follows:

<?php
// Fill two arrays with random numbers as proof.
$first_array = array(1000);
$second_array = array(1000);
for($i=0; $i<1000; $i++) $first_array[$i] = rand(0, 1000);
for($i=0; $i<1000; $i++) $second_array[$i] = rand(0, 1000);

$matches = array_intersect($first_array, $second_array);
$differences = array_diff($first_array, $second_array);

?>

Jack Shedd 2010-10-15 14:35:45

Answer 8

+9 A:

If you sort list2 first and then do a binary search for each number in list1 you'll see a huge speed increase.

I'm not a PHP guy, but this should give you the idea:

sort($numbers2);

foreach($numbers1 as $n1)
{
   if (BinarySearch($numbers2, $n1) >= 0) {
     doBla();
 }
}

Obviously not being a PHP guy I don't know the library, but I'm sure sorting and binary searching should be easy enough to find.

Note: In case you're not familiar with a binary search; you're sorting list2 because binary searches need to operate on sorted lists.

Giovanni Galbo 2010-10-15 14:43:11

If the desired behavior is exactly as implemented in the original question, in the case where $numbers1 has three 1's and $numbers2 has two 1's, doBla() would be run 6 times. In this implementation it would be run only 3 times.

RedDeckWins 2010-10-15 18:06:47

Unfortunately, there's an error on line 1. PHP's sort() will change $numbers2 in situ, and return either true or false.http://es.php.net/sort

Adam 2010-10-18 15:19:51

@Adam Thanks, fixed it. If anyone that knows PHP wants to fix my example go ahead, it's open source ;)

Giovanni Galbo 2010-10-18 18:29:20

Answer 9

+4 A:

$same_numbers = array_intersect($numbers1, $$numbers2);

foreach($same_numbers as $n)
{
  doBla();
}

Cesar 2010-10-15 15:04:36

Answer 10

+2 A:

Maybe I'm not seeing something here but this looks like a classic case of set intersection. Here's a few lines in perl that'll do it.

foreach $e (@a, @b) { $union{$e}++ && $isect{$e}++ }

@union = keys %union; @isect = keys %isect;

At the end of these lines of code @isect will contain all numbers that are in both @a and @b. I'm sure this is translatable to php more or less directly. FWIW, this is my favorite piece of code from the Perl Cookbook.

Shahbaz 2010-10-15 15:57:23

Answer 11

+1 A:

I think it would be much easier to use the built in array_intersect function. Using your example, you could do:

$results = array_intersect($numbers1, $numbers2);
foreach($results as $rk => $rv) {
    doSomething($rv);
}

Dave 2010-10-15 16:13:28

Answer 12

+3 A:

Sort both lists, then walk both lists at the same time using the old-master new-master sequential update pattern. As long as you can sort the data it is the fastest way since your really only walking the list once, to the longest length of the largest list.

skamradt 2010-10-15 16:56:58

+1 for pointing to a COBOL reference. :-)

ebneter 2010-10-15 17:41:02

@ebneter I find it much more readable than the RPG II reference. :)

skamradt 2010-10-15 18:09:06

Answer 13

+1 A:

A better way would be to do something like this:

// 1. Create a hash map from one of the lists.
var hm = { };
for (var i in list1) {
  if (!hm[list1[i]]) {
    hm[list1[i]] = 1;
  } else { hm[list1[i]] += 1; }
}

// 2. Lookup each element in the other list.
for (var i in list2) {
  if (hm[list2[i]] >= 1) {
    for (var j = 0; j < hm[list2[i]]; ++j) {
      doBla();
    }
  }
}

~~This is guaranteed O(n) [assuming insertion an lookup in a hash map is O(1) amortized].~~

Update: The worst case of this algorithm would be O(n²) and there is no way to reduce -- unless you change the semantics of the program. This is because in the worst case, the program will call doBla() n² number of times if all the numbers in both the lists are the same. However, if both the lists have unique numbers (i.e. generally unique within a list), then the runtime would tend towards O(n).

dhruvbird 2010-10-15 17:17:37

If worst case runtime of your algorithm is the same as runtime of the original one, you're losing out on readability. There's not enough detail in the OP to figure out what is actually needed and what is the composition of the arrays, anyway.

Srdjan Pejic 2010-10-15 18:17:50

@srdjan: I think it is pretty apparent that the OP wants to call doBlah() whenever a number from array-1 matches a number in array-2. His solution is Theta(n^2) whereas this one is O(n^2) and Omega(n).

dhruvbird 2010-10-21 20:43:47

Answer 14

+16 A:

What you have shouldnt take that long - what does doBla() do? I suspect that is taking the time? Comparing two sets of 1000000 numbers with the same algorithm takes no time at all..

This is hilarious - the number of optimisation techniques as answers - the problem is not your algorithm - it is whatever doBla() does that is taking the time by a factor many times greater than any optimisation would help you :)

Mrk Mnl 2010-10-15 17:26:53

I wonder why you got down-voted? You're right, of course - even the brute-force comparison he's got ought to be reasonably fast. He must be calling doBla a lot of times in the typical case, or it takes a long time to execute every time...

Mark Bessey 2010-10-15 21:56:13

Thanks, lets see if they find out..

Mrk Mnl 2010-10-16 02:10:16

Answer 15

+6 A:

I'm not a PHP expert, so this may need some debugging, but you can do this easily in O(n) time:

// Load one array into a hashtable, keyed by the number: O(n).
$keys1 = [];
foreach($numbers1 as $n1) $keys1[$n1] = true;

// Find the intersections with the other array:
foreach($numbers2 as $n2) { // O(n)
  if (isset($keys1[$n2]) { // O(1)
     doBla();
  }
}

Regardless, the intersection isn't where your time is going. Even a bad O(n^2) implementation like you have now should be able to go through 1000 numbers in a second.

munificent 2010-10-15 17:36:02

Answer 16

A:

Can you use a centralized file between server and client?

retore 2010-10-15 17:47:34

Answer 17

+3 A:

Stop- why are you doing this?

If the numbers are already in a SQL database, then do a join and let the DB figure out the most efficient route.

If they aren't in a database, then I'm betting you've gone off track somewhere and really ought to reconsider how you got here.

dethSwatch 2010-10-15 18:05:11

All we know is he has 2000 numbers, there's endless posibilities as to where and how he got those that doesn't involve a SQL database.

nos 2010-10-16 01:16:18

He did mention doing the work in the database, and then wondered if MySQL was sturdy enough to handle the load ;-)

RBerteig 2010-10-16 21:55:45

Answer 18

+2 A:

You can do it in O(n) time if you use bucket sort. Assuming you know the maximum value the numbers can take (although there are ways around that).

http://en.wikipedia.org/wiki/Bucket_sort

ashanan 2010-10-15 19:46:09

Answer 19

A:

Create two duplicate collections, preferably ones with fast lookup times, like HashSet or perhaps TreeSet. Avoid Lists as they have very poor lookup times.
As you find elements, remove them from both sets. This can reduce lookup times by having fewer elements to sift through in later searches.

Edwin Buck 2010-10-15 19:51:20

Answer 20

A:

If you're trying to get a list of numbers without any duplicates, you can use a hash:

$unique = array();
foreach ($list1 as $num) {
  $unique[$num] = $num;
}
foreach ($list2 as $num) {
  $unique[$num] = $num;
}
$unique = array_keys($unique);

It's going to be slightly (very slightly) slower than the array walk method, but it's cleaner in my opinion.

brianloveswords 2010-10-15 19:53:23

Answer 21

+1 A:

I'll create a GUI interface in Visual Basic, see if I can track the numbers

edahs 2010-10-15 21:22:06

Desperate times require desperate measures.

slomojo 2010-10-16 10:14:54

Answer 22

A:

Mergesort both lists, start at the beginning of both lists, and then search through each list for similar numbers at the same time.

So, in pseudocode, it would go something like...

Mergesort (List A);
Mergesort (list B)

$Apos = 0;
$Bpos = 0;

while( $Apos != A.Length && $Bpos != B.length) // while you have not reached the end of either list
{
if (A[$Apos] == B[$Bpos])// found a match
doSomething();

else if (A[$Apos] > B[$Bpos]) // B is lower than A, so have B try and catch up to A.
$Bpos++;

else if (A[$Apos] < B[$Bpos]) // the value at A is less than the value at B, so increment B
$Apos++;

}

If I'm right, the speed of this algorithm is O(n logn).

waffles 2010-10-15 22:49:45

Answer 23

A:

I'm not sure why Mrk Mnl was downvoted but the function call is the overhead here.

Push out the matched numbers into another array and doBla() on them after the comparisons. As a test // out doBla() and see if you are experiencing the same performance issue.

Martin Blank 2010-10-15 22:53:39

Answer 24

A:

merge, sort then count

<?php
$first = array('1001', '1002', '1003', '1004', '1005');
$second = array('1002', '1003', '1004', '1005', '1006');
$merged = array_merge($first, $first, $second);
sort($merged);
print_r(array_count_values($merged));
?>

output / the values with a count of three are the ones you want

Array
(
    [1001] => 2
    [1002] => 3
    [1003] => 3
    [1004] => 3
    [1005] => 3
    [1006] => 1
)

jaymz 2010-10-16 01:34:46

Answer 25

A:

This code will call doBla() once for each time a value in $numbers1 is found in $numbers2:

// get [val => occurences, ...] for $numbers2
$counts = array_count_values($numbers2);
foreach ($numbers1 as $n1) {
    // if $n1 occurs in $numbers2...
    if (isset($counts[$n1])) {
        // call doBla() once for each occurence
        for ($i=0; $i < $counts[$n1]; $i++) {
            doBla();
        }
    }
}

If you only need to call doBla() once if a match is found:

foreach ($numbers1 as $n1) {
    if (in_array($n1, $numbers2))
        doBla();
}

If $numbers1 and $numbers2 will only contain unique values, or if the number of times any specific value occurs in both arrays is not important, array_intersect() will do the job:

$dups = array_intersect($numbers1, $numbers2);
foreach ($dups as $n)
    doBla();

I agree with several earlier posts that the calls to doBla() are probably taking more time than iterating over the arrays.

gregjor 2010-10-16 09:49:20

Answer 26

A:

use Bucket Sort for best results

Elad Karako 2010-10-21 09:03:45

ansaurus

tags:

views:

answers:

How can I compare two sets of 1000 numbers against each other?

merge, sort then count

output / the values with a count of three are the ones you want

related questions