views:

764

answers:

14

I am interested in getting the remainder of the Euclidean division, that is, for a pair of integers (i, n), find r such as:

i = k * n + r, 0 <= r < |k|

the simple solution is:

int euc(int i, int n)
{
    int r;

    r = i % n;
    if ( r < 0) {
        r += n;
    }
    return r;
}

But since I need to execute this tens of million of times (it is used inside an iterator for multidimensional arrays), I would like to avoid the branching if possible. Requirements:

  • Branching but faster is also desirable.
  • A solution which works only for positive n is acceptable (but it has to work for negative i).
  • n is not known in advance, and can be any value > 0 and < MAX_INT

Edit

It is actually quite easy to get the result wrong, so here is an example of the expected results:

  • euc(0, 3) = 0
  • euc(1, 3) = 1
  • euc(2, 3) = 2
  • euc(3, 3) = 0
  • euc(-1, 3) = 2
  • euc(-2, 3) = 1
  • euc(-3, 3) = 0

Some people also worry that it does not make sense to optimize this. I need this for an multi-dimensional iterator where out of bounds items are replaced by items in a 'virtual array' which repeats the original array. So if my array x is [1, 2, 3, 4], the virtual array is [...., 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4], and for example, x[-2] is x[1], etc...

For a nd array of dimension d, I need d Euclidean division for every point. If I need to do a correlation between a n^d array with a m^d kernel, I need n^d * m^d * d euclidean divisions. For a 3d image of 100x100x100 points and a kernel of 5*5*5 points, that's already ~ 400 million Euclidean divisions.

+6  A: 

Edit: No multiplication or branches woot.

int euc(int i, int n)
{
    int r;

    r = i % n;
    r += n & (-(r < 0));

    return r;
}

Here is the generated code. According to the MSVC++ instrumenting profiler (my testing) and the OP's testing, they perform nearly the same.

; Original post
00401000  cdq              
00401001  idiv        eax,ecx 
00401003  mov         eax,edx 
00401005  test        eax,eax 
00401007  jge         euc+0Bh (40100Bh) 
00401009  add         eax,ecx 
0040100B  ret              

; Mine
00401020  cdq              
00401021  idiv        eax,ecx 
00401023  xor         eax,eax 
00401025  test        edx,edx 
00401027  setl        al   
0040102A  neg         eax  
0040102C  and         eax,ecx 
0040102E  add         eax,edx 
00401030  ret
280Z28
There is a good chance the compiler will actually use a branch to compute (r < 0) and multiplications are expensive.
Nixuz
Unfortunately, this method does not work either. For example, euc(-3, 3) returns 3 (if n == 3, the returned value should be in [0, 2], and -3 = -1 * 3 + 0).
David Cournapeau
oops, that happened because I changed `r` to `i`. Fixed.
280Z28
On a core 2 duo, both are basically the same, using a simple loop with rdtsc for testing. Looks like getting it significantly faster is quite difficult.
David Cournapeau
I got the similar results here in the MSVC++ instrumenting profiler.
280Z28
+1  A: 
int euc(int i, int n)
{
    return (i % n) + (((i % n) < 0) * n);
}
Jason
A: 

"Premature optimization is the root of all evil"

piotr
good thing it's mature
p00ya
why the downvotes?
piotr
A: 

If you can also guarantee that i is never less than -n, you can simply put the optional addition before the modulo. That way, you don't need the branch, and the modulo cuts out what you added if you didn't need to.

int euc(int i, int n)
{
    return (i + n) % n;
}

If i is less than -n, you can still use this method. In a case like this, you probably know exactly what range your values will be in. Therefore, instead of adding n to i, you can add x*n to i where x is any integer that gives you a sufficient range. For added speed (on processors that don't have single-cycle multiply), you could left-shift instead of multiplying.

arke
This doesn't work for `(i + n) < 0`
280Z28
Yup. Neither does the OP's code.
arke
That's a simple and nice solution. It is again not really faster on core duo, but it is quite faster (~ 15 % faster) than the other methods on pentium 4.
David Cournapeau
@arke: I believe my code is correct for any i: your code returns -2 for f(-5, 3), but mine returns 1, as expected.
David Cournapeau
Oops, you're right. Still, you can always scale n before you add it :). Now that I think about it, I'd left-shift instead of multiplying. Should still be faster on the P4.
arke
+2  A: 

Integer multiplication is much faster than division. For a large number of calls with a known N, you can replace division by N by multiplication by a pseudo-inverse of N.

I will illustrate this on an example. Take N=29. Then compute once a pseudo inverse 2^16/N: K=2259 (truncated from 2259.86...). I assume I is positive and I*K fits on 32 bits.

Quo = (I*K)>>16;   // replaces the division, Quo <= I/N
Mod = I - Quo*N;   // Mod >= I%N
while (Mod >= N) Mod -= N;  // compensate for the approximation

In my example, let's take I=753, we get Quo=25 and Mod=28. (no compensation needed)

EDIT.

In your 3D convolution example, most of the calls to i%n will be with i in 0..n-1, so in most cases, a first line like

if (i>=0 && i<n) return i;

will bypass the costly and here useless idiv.

Also, if you have enough RAM, just align all dimensions to powers of 2 and use bit manipulations (shift,and) instead of divisions.

EDIT 2.

I actually tried it on 10^9 calls. i%n: 2.93s, my code: 1.38s. Just keep in mind it implies a bound on I (I*K must fit on 32 bits).

One other thought: if your values are x+dx, with x in 0..n-1 and dx small, then the following will cover all the cases:

if (i<0) return i+n; else if (i>=n) return i-n;
return i;
Eric Bainville
you're right that my analysis was very coarse, and indeed, in the iterator, I do exactly this test before handling out of bounds situation. I cannot assume anything about memory (and if memory is not a problem, FFT-based convolution/correlation is much faster for most cases anyway).
David Cournapeau
+1  A: 

I timed everyone's proposals in gcc -O3 using TSC (except the one for constant N), and they all took the same amount of time (within 1%).

My thought was that either ((i%n)+n)%n (no branching), or (i+(n<<16))%n (obviously fails for large n or extremely negative i) would be faster, but they all took the same time.

wrang-wrang
I'll just add that this is still true as of now - nobody has a faster implementation (there are huge differences if you compile without optimization; none with).
wrang-wrang
A: 

For tens of millions this is not worth thinking about. Maybe for tens of billions, but then do proper benchmarking.

starblue
+1  A: 

I really like the expression:

r = ((i%n)+n)%n;

The disassembly is very short:

r = ((i%n)+n)%n;

004135AC  mov         eax,dword ptr [i] 
004135AF  cdq              
004135B0  idiv        eax,dword ptr [n] 
004135B3  add         edx,dword ptr [n] 
004135B6  mov         eax,edx 
004135B8  cdq              
004135B9  idiv        eax,dword ptr [n] 
004135BC  mov         dword ptr [r],edx

It has no jumps (2 idivs, which might be costly), and it can be completely inlined, avoiding the overhead of a function call.

What do you think?

abelenky
It's longer than mine, which also doesn't have any jumps and can be inlined? It is another clever one though. :)
280Z28
btw: that is the un-optimized assembly. With optimization, the compiler should avoid the redundant load of [n], instead using a register.
abelenky
+1  A: 

If you have low enough range create a look-up table - two dim array. Also you can make the function Inline, and make sure it is by looking at the produced code.

Liran Orevi
+4  A: 

I think 280Z28 and Christopher have the assembler golf covered better than I would, and that deals with random access.

What you're actually doing, though, seems to be processing entire arrays. Obviously for reasons of memory caching you already want to be doing this in order if possible, since avoiding a cache miss is a many, many times better optimisation than avoiding a small branch.

In that case, with a suitable bounds check first you can do the inner loop in what I will call "dashes". Check that the next k increments don't result in an overflow in the smallest dimension on either array, and then "dash" k steps using a new even-more-inner loop which just increments the "physical" index by 1 each time instead of doing another idiv. You or the compiler can unroll this loop, use Duff's device, etc.

If the kernel is small, and especially if it is fixed size, then that (or a multiple of it with suitable unrolling to occasionally subtract instead of adding), is probably the value to use for the length of the "dash". Compile-time constant dash length is probably best, since then you (or the compiler) can fully unroll the dash loop and elide the continuation condition. As long as this doesn't make the code too big to be fast, it essentially replaces the entire positive-modulo operation with an integer increment.

If the kernel is not fixed size, but is often very small in its last dimension, consider having different versions of the comparison function for the most common sizes, with the dash loop fully unrolled in each.

Another possibility is to calculate the next point at which an overflow will occur (in either array), and then dash to that value. You still have a continuation condition in the dash loop, but it goes as long as possible using only increments.

Alternatively, if operation you're doing is numeric equality or some other simple operation (I don't know what a "correlation" is) you could look at SIMD instructions or whatever, in which case the dash length should be a multiple of the widest single-instruction compare (or appropriate SIMD op) on your architecture. This isn't something I have experience with, though.

Steve Jessop
I don't think the original version of the question mentioned the application, but this is the only useful answer so far. Nobody has a faster implementation of euc, but it's definitely possible to speed up the higher level computation.
wrang-wrang
The original question only hinted, saying it was used millions of times. I don't think I posted this until after an edit said that we had (at least sometimes) sequential access. And to be fair, useful != answering the question: OP asked for a branchless implementation, not a fast one, and said "branching but faster is also desirable" ;-) Even with -O3 the OP's code has a branch on GCC 3 (all I have on this machine right now).
Steve Jessop
+3  A: 

Without a branch, but a bit bit-fiddling :

int euc2(int i, int n)
{
    int r;
    r = i % n;
    r += (((unsigned int)r) >> 31) * n;
    return r;
}

Without multiply :

int euc2(int i, int n)
{
    int r;
    r = i % n;
    r += (r >> 31) & n;
    return r;
}

This gives :

; _i$ = eax
; _n$ = ecx

cdq
idiv   ecx
mov eax, edx
sar eax, 31
and eax, ecx
add eax, edx
Christopher
the second version only works if signed right shifts are arithmetic
Christoph
It also assumes 32 bit integers, so the source would have to come with porting guidelines (or use stdint.h). You can force a logical shift by casting to unsigned, and then manipulate: ~((((unsigned)r)>>31)-1). Or something.
Steve Jessop
You can easily substitute the 31. e.g. (sizeof( int ) * 8 - 1)When no arithmetic shift exists, you have to abandon this, and use something more appropriate. ~((((unsigned)r)>>31)-1) looks good at first glance.
Christopher
Actually, if you want to be really portable, there is no guarantee that a char is 8 bits. That's solved with CHAR_BIT. However, there is also no guarantee for int that all bits of the storage representation participate in the value. IIRC only char has that guarantee. A conformant implementation could have a 9bit byte with 32 bit integers (in which case `sizeof(int)*8` works but `sizeof(int)*CHAR_BIT` doesn't), or a 9bit byte with 36 bit integers (in which case `sizeof(int)*CHAR_BIT` works but `sizeof(int)*8` doesn't. I think that's why C++ has `numeric_limits::digits`.
Steve Jessop
... in summary, I think you just require the porter to do something platform-specific, or settle for some slightly less good portable version if they don't want to or can't beat it. On some platforms the porter will want to do it in asm anyway. This is pretty much what static inline functions are for, after all :-)
Steve Jessop
@onebyone.livejournal.com: it would still work for 9bit bytes and 32bit integers: for 32bit integers, you can shift arithmetically by any value >= 31 to get all 1's for negative values and 0 for non-negative values
Christoph
No, a shift by too many bits is undefined. From this draft standard (http://www.vmunix.com/~gabor/c/draft.html#101): "If the value of the right operand is negative or is greater than or equal to the number of value and sign bits in the object representation of the promoted left-operand, the behavior is undefined". Note "value and sign bits", which is the size of the integer (32), not necessarily the full number of bits in the bytes used for storage.
Steve Jessop
If you find yourself in the place, where the algorithm fails, fallback to the original algorithm. It is not very hard to check.This is more or less a question about a low-level optimization, which introduces a trade off.BTW: Hand-generated assembler won't really help here. This is as simple as it gets. Only optimizations on a higher level (or special instruction sets) will make this faster, or am I missing something?
Christopher
This is also not any faster than the original version with gcc -O3
wrang-wrang
No, you haven't missed anything, I'm just speculating that there exists an architecture with a positive-modulus instruction. Not sure whether to call that instruction "special" or not. The compiler may not be able to figure out that this multi-line trickery is mathematically equivalent to positive-mod, so it's not "as simple as it gets" for that instruction set. Hence for optimal performance on that platform, the porter would want to asm it.
Steve Jessop
A: 

You say in your answer to Eric Bainville that most of the time 0 <= i < n and that you have

if (i>=0 && i<n) return i;

as the first line of your euc() anyway.

As you are making the comparisons anyway, you might as well use them:

int euc(int i, int n)
{
    if (n <= i)            return i % n;
    else if (i < 0)        return ((i + 1) % n) + n - 1;
    else /* 0 <= i < n */  return i;  // fastest possible response for common case
}
Dave Hinton
doesn't work for `i = -n` as it will return `n` instead of `0`
Christoph
Good spot. Fixed.
Dave Hinton
A: 

If you can guarantee that the dimensions of your array are always powers of two, then you can do this:

r = (i & (n - 1));

If you can further guarantee that your dimensions will be from a given subset, you can do:

template<int n>
int euc(int i) {
    return (i & (n - 1));
}

int euc(int i, int n) {
    switch (n) {
        case 2: return euc<2>(i);
        case 4: return euc<4>(i);
    }
}
Eric
A: 

Here's Christopher's version with a fallback to Jason's if right shifting is not arithmetic.

#include <limits.h>
static inline int euc(int i, int n)
{
    // check for arithmetic shift
    #if (-1 >> 1) == -1
        #define OFFSET ((i % n >> (sizeof(int) * CHAR_BIT - 1)) & n)
    #else
        #define OFFSET ((i % n < 0) * n)
    #endif

    return i % n + OFFSET;
}

The fallback version should be slower as it uses imul instead of and.

Christoph
As I mentioned earlier, these are all the same speed (even the original code) with g++ -O3.
wrang-wrang