ansaurus

Question

SOLVED: Which binary operator or function results in the least collisions?

Answer 1

A:

As long as you limit the set of possible output values to the SAME set of values as the operands you are using as operands, then you cannot do any better than addition. Addition, actually, is probably the best possible choice, because it is simplest. (See analysis below)

There are 2^64 possible longs, so there are 2 ^ 127 possible unordered pairs of longs, and only 2^64 possible longs for an answer, so the best possible distribution ratio is to have 2^ 63 different pairs that give the same answer, which addition, (with rollover) in fact, will do

EDIT: based on comments below.

however many (say it's N bits) a long is, there are 2^N different longs, so there are 2^N x 2^N ordered pairs of longs, but for the purposes of this analysis, using two longs x, and y is exactly the same as using y and x (the binary op is assumed to be communitative), so there are 2^ (2N-1) unordered pairs of longs.

so using unordered pairs (half as many) there are 2^N x (2^N-1) or 2^ (2N-1) pairs of longs without duplicates. (If N = 64, that's 2^127) So the maximum "distribution" of assignment of answers (from the smaller set of 2^64 longs) to the unordered pairs of operands (the larger set of 2^ 127 pairs) is if they are equally distributed. Thats what addition would do, because for each of possible long in the set of all longs, the sum of it with every other long (with rollover) will be a set of ... every long.

the only thing using ordered pairs does is allow you to use a non-communtative operand as well, but then you have to deal with all the cases where the answer is not in the set you are using for operands (like 5/4 )but even if you just assume rounding, the only impact on the analysis is that by using ordered pairs you get 2^2N different pairs of operands, instead of 2^(2N-1).

What you might do is limit the set of integers to be used as the operands to less than the square root of the number of possible longs (so if you are using 64 bit longs, limit your input values to 32 bit longs) Then, if you want absolutely no overlap or duplication (No case where A op B = same value as any other C op D), you can use the multiplication operator but for each value X in the smaller set of potential operands, choose the Xth prime number as the multiplicative operand. that way, no matter what two values A and B you randomly pick (from 1 to Max) the operantion would be multiplication of two different primes. This means the set of possible operands has to be smaller than the set of prime numbers equal to or less than the maximum possible value you are using for operand (if it's 64 bit unsigned longs, that 2^64)

2nd EDIT: based on yr specific problem, the set of possible operands is limited to the dimensions of a computer screen, considerably less than the number of longs, (no matter what platform you are on) So a very easy and obvious way to guarantee that every pair of possible screen cooridinates will generate a distinct and different seed key is to simply left shift the value of one coordinate sufficiently to guarantee no bitwise overlap with the other coordinate, and then bitwise Or the result with the other coordinate.

So if your screen is say, 3000x3000, then long lngVal = (x<<12 | y) would do it with minimal computational overhead as well.

Charles Bretana 2009-09-11 23:59:42

A long is 32 bits, in standard C at least.

Artelius 2009-09-12 00:02:14

A long is *not* 32 bits in standard C, unless you're using some new definition of the word "standard" of which I was previously unaware.

paxdiablo 2009-09-12 00:14:07

If I had time, I'd check the standard. I happen to be in a hurry. If you're going to flame me, you could at least explain what a long is actually defined as rather than just saying I'm wrong.

Artelius 2009-09-12 00:26:00

@Atelius: From 3.9.1 Fundamental Types: " There are five standard signed integer types : “signed char”, “short int”, “int”, “long int”, and “long long int”. In this list, each type provides at least as much storage as those preceding it in the list." The standard doesn't specify their actual sizes, just the relationships. There *are* specific types with set sizes (char32_t et al) but long is not one of them.

paxdiablo 2009-09-12 01:03:13

And it wasn't meant to be a flame. I just don't like to see people suffering under misapprehensions. I, for one, pronounced paradigm "pa ra dij m" for quite a while before someone set me straight that it was "pa ra dime" - I wonder how many people during that time wondered WTF I was talking about :-)

paxdiablo 2009-09-12 01:05:50

The "previously unaware" comment is a tribute to Douglas Adams (of HHGTTG fame) by the way.

paxdiablo 2009-09-12 01:09:02

Hah! for the first 5 years that I knew the word (from building acft models), I thought fuselage was few-SELL-a-gee" !

Charles Bretana 2009-09-12 03:11:44

Just to provide some concrete examples, a long is *not* 32 bits in OS X or most (all?) linuxes when running on x86_64. In fact, windows is the only OS I know of on which a long is 32 bits for 64-bit code.

Stephen Canon 2009-09-12 04:33:15

Can you guys explain to me why "the best possible distribution ratio is to have 2^ 63 different pairs that give the same answer"?

Penchant 2009-09-12 05:01:04

Thanks Pax, I stand corrected. Nonetheless, on a 32-bit machine longs are usually 32 bits.

Artelius 2009-09-12 09:48:59

@Artelius Unless you are programming in Java, where long is 64 bits. Anyway, for the purpose this question it doesn't matter.

starblue 2009-09-12 10:48:58

-1 for getting the probabilities wrong (hint: use ordered pairs).

starblue 2009-09-12 10:49:55

Right, Java! Today is not my day.

Artelius 2009-09-12 11:38:30

@Penchant, see additional analysis in my answer.

Charles Bretana 2009-09-12 15:41:29

@starblue, why use ordered pairs ? for any communitative operation, duplicate pairs give (x,y and y,x) will give the same answer.

Charles Bretana 2009-09-12 15:42:49

I never said that each solar system took only one pixel, so your assumption that they will not be much higher is wrong. My intention is that there be hundreds of millions of them. Sorry, if my problem is unclear.

Penchant 2009-09-13 02:26:51

@Penchant, no matter how many 'pixels' each solar system takes, the key to the answer to yr question lies in the ratio between that number and the number of different longs that exist. If the 'coordinates' of your solar system will be nbetween 1 and 2^32 (and that's a very large number, then you can use the approach I mentioned (generating a 64 bit long whose low 32 bits are from x-ccord, and high 32 bits are from the y-coord) and still not ever have a duplicate.

Charles Bretana 2009-09-13 04:56:07

Yes, you are right.

Penchant 2009-09-13 05:22:26

@Charles Bretana Yes, but to get the correct probability you have to count the unordered pairs.

starblue 2009-09-13 19:06:32

@starblue, re: probabilities. To be honest, I'm still not getting yr point. What do probailities have to do with this? The only prob involved is prob of encountering a specific set of coordinates, (x,y) - which a) is not specified, so it can be asssumed to be the same as for any other set, and b) has nothing to do with what the outcome of calculating the result of x op y. And, as I said, if the chosen op is commutative, then x op y = y op x, so for the purposes of this analysis, in that case, unordered pairs is a more appropriate choice.

Charles Bretana 2009-09-14 14:03:26

Answer 2

+7 A:

If (x1, y1) and (x2, y2) are two random pairs of inputs, then let f1 = f(x1,y1) and f2 = f(x2,y2).

What you want to do is minimise

P( f(x1,y1) = f(x2,y2) )
 = P(f1 = f2)
 = sum for i in [LONG_MIN ... LONG_MAX]
        of P(f1 = i) * P(f2 = i)
 = sum for i in [LONG_MIN ... LONG_MAX]
        of P(f1 = i)^2

So you want to minimise the sum of the squares of the probabilities of each of your function's outputs. Since the sum of the probabilities must be 1, we know:

sum for i in [LONG_MIN ... LONG_MAX]
     of P(f1 = i)
  = 1

And we also know that for all i, P(f1 = i) is between 0 and 1 (inclusive). Intuitively, then, minimising P(f1 = f2) is a matter of making the probability distribution of f1 as even as possible. (This can be proven mathematically but it's not really important to the question.) Ideally, P(f1 = i) and P(f1 = j) should be the same for all longs i and j.

Now let's look at some different possibilities for the nature of x and y.

First the general case, where x and y are uniformly distributed over the range of a long. (In other words, x is equally likely to be anything a long can be. So is y.) In this case, we can let f(x, y) = x+y, or f(x,y) = x-y, or f(x,y) = x XOR y, or even f(x,y) = x and (assuming normal integer overflow) we find that we have a uniformly distributed f, too, which means these functions are all "optimal".

But the f(x,y) = x example shows you that there's really not that much you can gain here.

However, in practice, your x and y will probably not be uniformly distributed. For instance, if x and y both drawn randomly from the range [0, 9999], then using f(x,y) = x + y * 10000 will always produce different output for different input.

If, in each (x, y) pair, x and y are very likely to be near each other, for instance (1240,1249), (1,3), (-159720,-159721), then f(x,y) = x is actually quite a good candidate function.

If x and y are "probably not huge", then you should combine the 16 low bits of x with the 16 low bits of y, i.e. f(x,y) = ((x&0xFFFF) << 16) | (y&0xFFFF), because the lower bits will be more evenly distributed than the upper bits.

This works very well if x and y are never negative. But if they are, the sign bit (that tells whether the number is positive or negative) might be more evenly distributed than some of the 16 low bits. So you may want to use it instead. E.g.

f(x,y) = ((x&0x7FFF) << 17) | ((y&0x7FFF) << 2) | ((x>>30) & 2) | (y>>31)

As the "probably not huge" case is quite common, I think this function would actually work quite well in general.

Artelius 2009-09-12 00:07:02

Nice. You might also consider to change the multiplier to an odd integer, e.g. f(x,y) = x + 10001*y. This has the advantage that for any y1 != y2 regardless of their size, we have y1*10001 != y2*10001. Hence f(x,y) never leads to a collision just by changing only x or only y.

Accipitridae 2009-09-12 09:48:34

@Accipitridae: Yes, exactly! Great point.

Artelius 2009-09-12 11:37:08

I'm using Java for this program, where longs have 64 bits. I guess all of the constants would in that function will double huh?

Penchant 2009-09-12 14:54:37

Artelius 2009-09-12 23:19:41

@Artelius: I think there is a typo in the formula above. First the bits 31..62 of x do not influence the result. Also at least some of the OR operators should probably be replaced by XORs or addition, since right now there is a bias towards resutls with many bits set to 1.

Accipitridae 2009-09-14 06:49:07

Bits 31..62 of x and y aren't *meant* to influence the result. I'm throwing them away, because I'm assuming that they're almost certainly zero. And if they're nonzero, the lower bits could be anything.

Artelius 2009-09-14 13:47:52

You're right. I missunderstood your proposal.

Accipitridae 2009-09-14 17:59:48

Answer 3

+5 A:

I like the answer and anlysis by Artelius. Especially the proposal to use

f(x,y) = x + y*K

for some constant K is interesting and I'd like to just add a few more thoughts. What I'm doing here is not new, but very closely related to the Fibonacci hashing, which I think has been proposed by Knuth.

If we are using 64-bit integers then a collision f(x1, y1) = f(x2, y2) means

0 = (dx + dy * K) mod 2⁶⁴,

where dx = x1 - x2 and dy = y1 - y2. This is the same as

K = -dx*dy^-1 mod 2⁶⁴,

where dy^-1 is the modular inverse modulo 2⁶⁴. If we want to choose the K such that f(x1, y1) != f(x2, y2) whenever the differences dx and dy are both small then we have to choose K such that

K = -dx*dy^-1 mod 2⁶⁴,

has no solution such that both dx and dy are small. This can be achieved for example by choosing K close to phi * 2⁶⁴, where phi = (sqrt(5)-1)/2 is the golden ratio. The golden ratio has a very special continued fraction expansion, i.e. in a certain sense it is a number that is hard to approximate well with a fraction.

Hence, for 64-bit unsigned integers the following functions could be used

f(x,y) = x + y * 11400714819323198485;

or equivalently when using signed 64-bit integers

f(x,y) = x - y * 7046029254386353131;

Accipitridae 2009-09-12 18:34:30

So you think this is the best possible function for this problem? It is a bit hard to see the math, as I'm still in high school.

Penchant 2009-09-13 02:24:44

But, still, thanks a lot for the effort that you've put into this.

Penchant 2009-09-13 02:27:57

Charles Bretana 2009-09-13 17:06:07

What the "best" solution is depends on how "best" is defined. Modern CPUs have very fast multiplicators, hence it is much less important whether a multiplication can be replaced by shift operations or not. Artelius has a definition for "best": i.e. f(x,y) should be uniformly distributed if the inputs x, and y are uniformly distributed. My proposal has the same property and additionally tries to minimize the number of collisions of inputs when the differences between the inputs are small. But if your requirements change, then of course the solution might as well.

Accipitridae 2009-09-14 06:48:28

ansaurus

tags:

views:

answers:

SOLVED: Which binary operator or function results in the least collisions?

related questions