tags:

views:

116

answers:

7

I have two lists and I need to determine if they contain the same values without sorting (ie. order of values is irrelevant). I know sorting the would work, but this is part of a performance critical section.

Item values fall within the range [-2, 63] and we're always comparing equal size lists, but the list sizes range from [1, 8].

Example lists:

A = (0,   0, 4, 23, 10)
B = (23, 10, 0,  4,  0)
C = (0,   0, 4, 27, 10)

A == B is true
A == C is false

I think a possible solution would be to compare the product of the two lists (multiply all values together), but there are problems with this solution. What to do with zero and negative numbers. A workaround would be adding 4 to every value before multiplying. Here's the code I have so far.

bool equal(int A[], int B[], int size)
{
    int sumA = 1;
    int sumB = 1;

    for (int i = 0; i < size; i++) {
        sumA *= A[i] + 4;
        sumB *= B[i] + 4;
    }
    return (sumA == sumB)
}

But, would this always work no matter what the order/contents of the list were? In other words is the following mathematically true? So what I'm really asking is the following (unless there's another way to solve the problem):

Given 2 equal sized lists. If the products (multiplying all values together) of the lists are equal then the lists contain the same values, so long as the values are integers greater than 0.

A: 

Make a copy of the first list. Then loop through the second and as you do remove each item from the copy. If you get all the way through the second list and found all elements in the copy, then the lists have the same elements. This is a lot of looping, but with only max 8 elements in the list, you won't get a performance gain by using a different type of collection.

If you had a lot more items, then has a Dictionary/Hashtable for the copy. Keep a unique key of values with a count of how many times they were found in the first list. This will give you a performance boost on larger lists.

Sam
A: 

Given 2 equal sized lists. If the products (multiplying all values together) of the lists are equal then the lists contain the same values, so long as the values are integers greater than 0.

No. Consider the following lists

(9, 9)
(3, 27)

They are the same size and the product of the elements are the same.

Greg
If the products are different, then the lists are different.
EvilTeach
@EvilTeach - True but if the products are the same, you don't know if the lists are the same.
Greg
A: 

How fast do you need to process 8 integers? Sorting 8 things in any modern processor is going to take almost no time.

The easy thing is to just use an array of size 66 where index 0 represents value -2. Then you just increment counts across both arrays, and then you just iterate across them afterwards.

Anon
It's not the single case that I'm worried about. Its because what I need it for will be processing 100k+ lists.
Justin
+1  A: 

Since you only have 66 possible numbers, you can create a bit vector (3 32-bit words or 2 64-bit words) and compare those. You can do it all with just shifts and adds. Since there are no comparisons required until the end (to find out if they are equal), it can run fast because there won't be many branches.

Gabe
+6  A: 

Assuming you know the range ahead of time, you can use a variation on counting sort. Just scan through each array and keep track of how many times each integer occurs.

Procedure Compare-Lists(A, B, min, max)
  domain := max - min
  Count := new int[domain]
  for i in A:
    Count[i - min] += 1
  for i in B:
    Count[i - min] -= 1
    if Count[i - min] < 0:
      // Something was in B but not A
      return "Different"
  for i in A:
    if Count[i - min] > 0:
      // Something was in A but not B
      return "Different"
  return "Same"

This is linear in O(len(A) + len(B))

jleedev
Note that you can stop early in the `B` part if `Count` gets negative.
schnaader
@schnaader Good call; I’ve added that.
jleedev
Couldn't we change `for k in Count{...}` to `for i in A{...},for i in B{...}` for large `domain` values and small arrays like the OP's example? This way, we'd skip untouched entries in `Count`, this should give `O((len(A)+len(B))*2)`.
schnaader
Scan through each array twice? That sounds quite sane. In fact, you can just loop through `A`, `B`, and then `A` again.
jleedev
The runtime is O(2*len(A)+len(B)) right? And memory would be O(n). At that point I'd think that just sorting would be faster?
Justin
@Justin Big-O notation is somewhat irrelevant when you know that there are 8 elements of 66 possible values. But yeah, sorting could be faster.
jleedev
@Justin: That was my thought, too (I commented on the question.) But as @jleedev pointed out, with so few items, big O becomes less relevant.
JoshD
A: 

If your list has only 8 items then sorting is hardly a performance hit. If you want to do this without sorting you can do so using a hashmap.

  1. Iterate over the first array and for each value N in the array Hash(N) = 1.
  2. Iterate over the second array and for each value M, Hash(M) = Hash(M) + 1.
  3. Iterate over the hash and find all keys K for which Hash(K) = 2.
Sagar V
+2  A: 

You could do this with primes. Keep a prime table for the first 66 primes and use the elements of your arrays (offset by +2) to index into the prime table.

The identity of an array is then just the product of the primes represented by the elements in the array.

Unfortunately, the product must be represented with at least 67 bits:

  • The 66th prime is 317, and 3178 = 101,970,394,089,246,452,641
  • log2(101,970,394,089,246,452,641) = 66.47 (rounded up) is 67 bits

Example pseudocode for doing this (assuming the existence of an int128 data type):

int primes[] = 
{
      2,   3,   5,   7,  11,  13,  17,  19,  23,  29,
     31,  37,  41,  43,  47,  53,  59,  61,  67,  71,
     73,  79,  83,  89,  97, 101, 103, 107, 109, 113,
    127, 131, 137, 139, 149, 151, 157, 163, 167, 173,
    179, 181, 191, 193, 197, 199, 211, 223, 227, 229,
    233, 239, 241, 251, 257, 263, 269, 271, 277, 281,
    283, 293, 307, 311, 313, 317
};

// Assumes:
// Each xs[i] is [-2, 63]
// length is [1, 8]
int128 identity(int xs[], int length)
{
    int128 product = 1;

    for (int i = 0; i < length; ++i)
    {
        product *= primes[xs[i] + 2];
    }

    return product;
}

bool equal(int a[], int b[], int size)
{
    return identity(a, size) == identity(b, size);
}

You might be able to use a long double on GCC to store the product since it is defined as an 80-bit data type, but I'm not sure if the floating-point multiplication error would cause collisions between lists. I haven't verified this.


My previous solution below does not work, see the comments below.

For each list:

  • Compute the sum of all elements
  • Compute the product of all elements
  • Store the length of the list (in your case, since the length is guaranteed to be the same for two lists, you can ignore it entirely)

As you compute the sum and product, each element needs to be adjusted by +3, so your range is now [1, 66].

The (sum, product, length) tuple is the identity for your list. Any lists with the same identity are equal.

You can fit this (sum, product, length) tuple into a single 64-bit number:

  • For the product: 668 = 360,040,606,269,696, log2(360,040,606,269,696) = 48.36 (rounded up) is 49 bits
  • For the sum: 66 * 8 = 528, log2(528) = 9.04 (rounded up) is 10 bits
  • Length is in the range [1, 8], log2(8) = 3 bits
  • 49 + 10 + 3 = 62 bits for representing the identity

Then, you can do direct 64-bit comparisons to determine equality.

Running-time is linear in the size of the arrays with a single pass over each. Memory usage is O(1).

Example code:

#include <cstdint>
#include <stdlib.h>

// Assumes:
// Each xs[i] is [-2, 63]
// length is [1, 8]
uint64_t identity(int xs[], int length)
{
    uint64_t product = 1;
    uint64_t sum = 0;

    for (int i = 0; i < length; ++i)
    {
        int element = xs[i] + 3;
        product *= element;
        sum += element;
    }

    return (uint64_t)length << 59 | (sum << 49) | product;
}

bool equal(int a[], int b[], int size)
{
    return identity(a, size) == identity(b, size);
}

void main()
{
    int a[] = { 23, 0, -2,  6,  3, 23, -1 };
    int b[] = { 0, -1,  6, 23, 23, -2,  3 };

    printf("%d\n", equal(a, b, _countof(a)));
}
Chris Schmich
This won't work. You can see for yourself that 1*6*6=2*2*9 and 1+6+6=2+2+9. Therefore, your algorithm will declare lists {-2,3,3,-2,-2,-2,-2,-2} and {-1,-1,6,-2,-2,-2,-2,-2} equal to each other.
@user434507: you're absolutely correct, I'll post an alternate solution shortly.
Chris Schmich