views:

1174

answers:

20

Hi,

I have a code in which i am computing x % 25. x always takes a positive value but its dynamic range is large.

I found out that this particular code piece of computing a x % 25 is taking large cycles. I need to optimize it.

Pre-computed lookup table is ruled out due to the possible large memory size of the table.

As second approach i coded a fragment below(C code) -

mod(a, b)
{   
    int r = a;  
    while(r >= b)
    {      
     r = r - b;
    }   
    return r;
}

1.) How can i optimize this code further for cycles(squeeze it to max)?

2.) Is there any entirely different optimized way to achieve x % 25( i know its not a common operation, but still, looking for clever inputs people might have used in their experience which might nelp me.).

Thank you.

-AD

EDIT:

I think using a native modulo operator % in C , internally uses a division operation (/) which is costly on the processor i am using.(No div instruction). hence trying to see if custom implemetation can beat the inherent computation using % operator.

-AD

A: 

Why can't you just use the operator %? If this is C code, and the numbers are ordinary "native" int:s, then that should be the fastest way, by far.

unwind
the second argument is fixed; as a specific algorithm can outperform a generic one, hand-optimized code might outperform the compiler
Christoph
A: 

If you don't like % operator:

int mod(int a, int b) {
    int integral = a / b;
    return a - (b*integral);
}
qrdl
Why downvote? What's wrong with the code?
qrdl
I guess since the OP tries to avoid using division, an algorithm with a division in it won't help much (i didn't downvote). OP only clarified this *after* your answer though
Ben Schwehn
Exactly. OP didn't mention initially that division should be avoided as well.
qrdl
Well, he said say the CPU doesn't have a div operator. Surely that would have been a clue? :-)
paxdiablo
He didn't say it initially - it appeared only after edit.
qrdl
A: 

Is there a reason why you cant use C's built in modulus operator?

int a = x % 25;

Following your edit;

If your rpocessor does not have built in modulo support then I would still use the % operator for the simple reason that your compiler will know that the processor in question doesnt have a native % function, and will likely produce asm code to optimally emulate it.

Put it this way - I'd be fascinated if you can come up with a genarl algorithm that outperforms whatevr the compiler produces from using the built in operator, notwithsatanding specific cases (such as simply taking the 2 lowest digits for modulo 100 etc)

Visage
"will likely produce asm code to optimally emulate it" - I doubt that the compiler has optimised code for every constant value. I suspect that it would just use a standard divide/modulus algorithm (except for powers of two) - I doubt that this would be optimal for known constants.
Dipstick
Actually ... it has :)
You can call me Chuck
A: 

I find it pretty strange that the operation x % 25 takes such long time (if you are using the built-in % operator, that is). Most modern processors should do this in a single instruction. I'd look for other reasons that this code takes so long.

EDIT: Here's an algorithm that might at least give some ideas:

256 = 6 (mod 25)

This means that if we write a number x as bytes x3 x2 x1 x0 we have that x = 6^3*x3 + 6^2*x2 + 6*x1 + x0 (mod 25)

This gives an algorithm for reducing the size of x:

int x0 = x & 0xFF, x1 = (x>>8) & 0xFF, x2 = (x>>16) & 0xFF, x3 = (x>>24) & 0xFF;

int y = x4;
y = (y << 2) + (y << 1) + x3;
y = (y << 2) + (y << 1) + x2;
y = (y << 2) + (y << 1) + x1;
y = (y << 2) + (y << 1) + x0;

(here (y << 2) + (y << 1) = 4*y + 2*y = 6*y)

After this y will have the same remainder as x mod 25. Iterating this 1, 2 or 3 times will make y a 17, 11, or 9 bit number, respectively. One of these sizes might be small enough to make a lookup table of.

I SERIOUSLY doubt that this would be faster than the builtin % operator, though.

CAdaker
On many smaller platforms, division (and, by extension, modulo) is very slow compared to other operations. Remember, your target platform might be a toaster! This was true for a long time even for desktop platforms - IIRC 26 cycles for a DIV on 486 or early Pentiums, compared to 1 cycle for addition and a few for multiplication.
peterchen
Yes, but it was not clear from the original question whether the problem really was a lack of DIV instruction or not. And I still think that if your code is slow, the first thought should not be to replace the C compilers built-in arithmetic operators.
CAdaker
div instruction is often (even on x86) implemented in microcode, and thus, slow.
kotlinski
A: 

If you are only considering the number 25 you can use the fact that 25 divies an integer if and only if the last two digits of the integer are 00, 25, 50 or 75. So to get the modulo you consider the last two digits and then subtract the nearest of 00, 25, 50 or 75.

Bessi
And you would get the last two digits how? Modulo-100? :-)
paxdiablo
Numbers are represented in binary form so it is not that easy to work with decimal digits. And how do you find nearest proper divisor? It is obvious for human but there is no such CPU instruction.
qrdl
Maybe his processor has a BCD mode. :-)
Nosredna
It does ofcourse depend on the context. For instance the data might initially be coming from a text file. Although it's clear now from his edit that this is likely not what he was looking for it is a way nontheless.
Bessi
A: 

If you know that b will be a power of 2, you could use bitwise AND instead of the modulo operator. However, the wikipedia page for modulo seems to indicate that any C compiler would notice this and optimize out the modulo anyway.

wkf
25 is not a power of two, hth.
Cirno de Bergerac
Meh, was just offering the only optimization I could think of; maybe it will help someone else.
wkf
+2  A: 

If your C compiler is targeting a CPU with no divide instruction, you can modify your code as follows:

mod(a, b) {
    int s = b + b + b + b;
    int r = a;
    while(r >= s) {
        r -= s;
    }
    while(r >= b) {
        r -= b;
    }
    return r;
}

This works by subtracting the values in chunks of four rather than one, right up until the last one then it switches to subtracting chunks of one.

This should make your code run about four times as fast (assuming 4*b isn't outside the range of your integers). You could even insert more loops (say an 8*b one) before the 4*b one for even more speed.

Other than that, hand-coding assembler may help but I think you'll find quite a boost from the above code without it.

If you know more detail on the way you'll be using the mod call, you can optimize it for your particular cases. For example, if you only want to know modulo 25 of a 16-bit integer, the following code will be much faster than a simplistic loop with variable denominator.

int mod25 (int a) {                // a has maximum value of 2^15-1 = 32767
    while (a >= 15625) a-= 15625;  // at most 2 times.
    while (a >= 625) a-= 625;      // at most 24 times.
    while (a >= 25) a-= 25;        // at most 24 times.
    return a;
}

Running a test, I find that you have to do 10 million iterations before a noticeable difference appears between that modulo code and the use of the % operator (2 seconds vs. 0 seconds). Up until that point, they were both 0 seconds, although that was run on a fast machine (better for mod25) and with a div instruction (better for % operator) so you'd need to benchmark it on your own hardware.

This is about as fast as you're likely to get without making your code unreadable (although even that shouldn't stop you if you're willing to add lots of comments explaining how it works).

A more general solution for any denominator is to first double the denominator (with bit shifts for speed) as far as possible so that the ensuing subtractions are minimized. Then, as the numerator reduces below the increased denominator, halve the denominator and keep going (until the denominator is back at the start).

int mod (int n, int d) {
    /* dx is the adjusted denom, don't let it overflow though. */
    int dx = d;
    while (((dx << 1) >>1) == dx)
        dx <<= 1;

    /* This loop processes the dx values until they get too small. */
    while (dx >= d) {
        /* This loop subtracts the large dx value. */
        while (n >= dx)
            n -= dx;
        dx >>= 1;
    }
    return n;
}

This actually performs on par with the optimized version of mod25 above while providing a more general solution.

paxdiablo
Considering your large numbers you might want to have s = b*16 rather than 4. You can do that with a 4 bit shift left shift instruction.
Tom Leys
+3  A: 

The problem with your loop is that it's O(n) - it'll be very slow for large values of r. I'd suggest something like this:

for (int s = MAX_SHIFT; s>=0; s--)
  if (r > (b<<s)) r -= (b<<s);

But I doubt that your compiler is doing anything much more expensive than that.

Niki
I assume you need to set MAX_SHIFT dynamically to ensure that (b<<s) doesn't overflow, yes?
paxdiablo
+6  A: 

Since you want the modulus by a constant, you can probably beat it by using reciprocal multiplication. This paper shows how you can divide by a constant in such a manner, and towards the end, how to get the remainder from it.

Cirno de Bergerac
Before optimizing anything, always check the disassembly. I recently discovered the reciprocal trick with code like : int a = x % 3; int b = x / 3; This code ended up as a single multiplication and a shift.
You can call me Chuck
A: 

How about:

int y = 0, x = (x & 0x7f); 
while (x > 25) { x -= 25; y++; }

Update: it's pretty wrong :) But the idea is there.

leppie
+7  A: 

Oh my <deity of choice>. I can't believe some of these answers.

First thing, repeated subtraction, even Pax's version, will never, ever be optimal. Consider, the following:

20 % 25

that's easy and fast using repeated subtraction, but:

65535 % 25

will be horribly slow, 600+ iterations. That's an average of 300 iterations for 16 bit numbers. As for 32 bit number, well, just don't even go there.

The fastest way to do this is to use long division. See Niki's answer.

But, this is what the compiler will be generating anyway, at least, one would hope it is what the compiler is generating. It's always best to check if you're using a compiler for a niche processor.

The best way to speed this up is to not do the modulus in the first place. Why do you need to get the modulus and can you re-factor the code / algorithm to avoid the modulus, or at least, make the modulus trivial.

Skizz

Skizz
+2  A: 

I was inspired by Pax's answer and made a more general purpose algorithm.

int mod(int a, int b) {
    int s = b;
    while (s <= a) {
        s <<= 1;
    }
    int r = a;
    while (r >= b) {
     s >>= 1;
     if (s <= r) {    
      r -= s;
     }
    }
    return r;
}

This subtracts power of two multiples of b from a until the result is found.

EDIT: added the if condition to make it work properly.

As an example, if this is doing 100 % 7, it first works out that 7 * 2 * 2 * 2 * 2 = 112. Then it divides 112 (s) by 2 and subtracts that from 100 (r) (when s <= r) and continually does this until the modulo is found. Therefore,

s = 112 / 2 = 56, r = 100 - 56 = 44
s = 56 / 2 = 28, r = 44 - 28 = 16
s = 28 / 2 = 14, r = 16 - 14 = 2

therefore, 100 % 7 = 2

David Johnstone
I'm leaving this comment as a bookmark so I can remember to check back and try to formally prove this later ;)
Paul Fisher
+1  A: 

Possibly not the fastest but reasonably efficient. I haven't got time to test, but use a look up table of (powers of 2) * 25 up to the maximum range/2. Then do a loop. E.g. range up to 3199 needs 7 iterations.

static int pow[] = {25, 50, 100, 200, 400, 800, 1600};

int mod25(int x)
{    
    int i = sizeof pow /sizeof pow[0];

    while (i--)
    {
        if (x >= pow[i])
            x -= pow[i];    
    }    
    return x;
}

If you have a very large range but low values are more common then it might be worthwhile usng a binary chop to find the starting point.

Dipstick
+2  A: 

On many processors, integer multiplication is faster than integer division. This blog post shows how to replace a constant integer division with a constant integer multiplication. By rearranging the maths a bit you can get the remainder instead of the quotient. Note, however, that if you are using a moderately sophisticated compiler, then this is already done for you. You just write x % 25 and the compiler works out the rest. You should check the generated assembly code for your code, verifying that the compiler has not done this already, before doing this optimisation in C. Also, you should measure (profile) the performance before and after to ensure that you really are making things faster.

Looping will be far slower than doing the division using the native instruction for reasonably large operands.

Edit: see also this paper.

Doug
+15  A: 

I suggest reading Hacker's Delight. It describes very fast remainder algorithms for constant divisors. They would almost certainly beat a general algorithm.

Update: Here is some example code... It can probably be reworked to avoid the temporary long long.

unsigned mod25(unsigned n)
{
    unsigned reciprocal = 1374389535; // 2^35 / 25
    unsigned div25 = ((unsigned long long)n * reciprocal) >> 35;
    return n - div25 * 25;
}
kotlinski
Yeah there's a fast division by 5 there. Do that twice and you're set if you have a fast multiply. All depends on details of his processor (word size, instructions) and the desired range of inputs. There's also a cool modulo 5, which would probably help.
Nosredna
GCC on x86 will use this algorithm for computing `% 25` - if you check the disassembly, you'll find the magic number, a `mull` and a `shrl` instruction (the shift will only be by 3 and not 35 because of value placement in the registers)
Christoph
+5  A: 

Here's the best I could come up with:

int mod25(int x)
{
    while((x = (x & 31) + 7 * (x >> 5)) >= 25)
     x -= 25;

    return x;
}

It approximates x % 25 with x % 32 + 7 * (x/32). The value will overshoot by a multiple of 25, which allows for recursion.

Performance seems to be adequate: A value of x = 2147483647 (aka INT_MAX) needs 11 iterations.

Christoph
A: 

If you kept your numbers in BCD or a byte array of digits, this would be pretty easy. Unfortunately, I have no idea what else you're doing in your program with these numbers. Sometimes it pays to look at how you represent your data rather than just bang away at algorithms.

Nosredna
+1  A: 
int mod25(int x) {
  static int divisors[] = {2147483625, 244140625, 9765625, 390625, 15625, 625, 25};
  int i;
  for (i = 0; i < sizeof(divisors)/sizeof(int); i++) {
    int divisor = divisors[i];
    while (x >= divisor) {
      x -= divisor;
    }
  }
  return x;
}

How it works: We want to decrement x by large multiples of 25 to reduce the value as fast as possible. When the divisor is too big we switch to a smaller multiple of 25. If the divisor is already down to 25 then we're done.

You could try experimenting with different divisors. You just want to make sure that:

  • they're descending
  • they're all multiples of 25
  • the last value is 25

In the code above I used the largest signed-32-bit multiple of 25 plus the powers of 25, which seems reasonable, though I have to admit that I'm not sure that it's optimal.

(BTW: if your compiler doesn't do constant folding -- which would be very surprising -- then you might want to replace the upper-limit of i with a hard-coded constant.)

Laurence Gonsalves
That is very similar to my initial answer but after a little thought I decided that you could end up looping up to 24 times in the inner loop for each outer loop.
Dipstick
Yes, that's true. I posted another answer which does exactly 27 iterations, and when unrolled does 27 comparisons and up to 27 subtractions, and works for all non-negative (signed 32-bit) inputs.
Laurence Gonsalves
+4  A: 

Here's another solution I came up with:

int mod25(int x){
  /* 25 * (all powers of 2 <= INT_MAX), descending */
  if (x >= 1677721600) x -= 1677721600;
  if (x >=  838860800) x -=  838860800;
  if (x >=  419430400) x -=  419430400;
  if (x >=  209715200) x -=  209715200;
  if (x >=  104857600) x -=  104857600;
  if (x >=   52428800) x -=   52428800;
  if (x >=   26214400) x -=   26214400;
  if (x >=   13107200) x -=   13107200;
  if (x >=    6553600) x -=    6553600;
  if (x >=    3276800) x -=    3276800;
  if (x >=    1638400) x -=    1638400;
  if (x >=     819200) x -=     819200;
  if (x >=     409600) x -=     409600;
  if (x >=     204800) x -=     204800;
  if (x >=     102400) x -=     102400;
  if (x >=      51200) x -=      51200;
  if (x >=      25600) x -=      25600;
  if (x >=      12800) x -=      12800;
  if (x >=       6400) x -=       6400;
  if (x >=       3200) x -=       3200;
  if (x >=       1600) x -=       1600;
  if (x >=        800) x -=        800;
  if (x >=        400) x -=        400;
  if (x >=        200) x -=        200;
  if (x >=        100) x -=        100;
  if (x >=         50) x -=         50;
  if (x >=         25) x -=         25;
  return x;
}

This doesn't use divides or multiplys, just 27 comparisons and a maximum of 27 subtractions.

It's a little hard to convince yourself that this works, but it does (at least for non-negative values of x).

The above code is really an unrolled version of this:

int mod25(int x){
  int divisor;
  for(int divisor = 1677721600; divisor >= 25; divisor >>= 1) {
    if (x >= divisor) x -= divisor;
  }
  return x;
}

By unrolling it we avoid doing the loop comparison and also the shifts at the expense of larger code. You could even partially unroll it using Duff's device if you felt so inclined, but with only 27 iterations total, and such a tiny bit of code per-iteration, I'd be inclined to just unroll it all the way.

Here's how it works: Every non-negative integer x can be expressed as (n * 25) + k where n is a non-negative integer and k is an integer from 0 to 24. k also happens to be the result we want, so if we could compute x - (n * 25) we'd get our answer. We want to be able to do this without knowing n up-front, though.

Think about n in binary. If we could turn off each of the 1 bits we'd get 0. One way to do this is to start at large powers of 2 and work our way down, subtracting each power of 2 only if the current value of n is greater than or equal to that power of 2.

Since we're dealing with (n * 25) we actually need descending powers of 2 times 25. Since k is strictly less than 25, and the smallest divisor we ever consider is 25, this works even when we're dealing with (n * 25) + k.

So each comparison + subtraction is zeroing out one bit of n, and at the end we're left with k, the remainder.

Laurence Gonsalves
A: 

Heres an Idea

static int table0[256];
static int table1[256];
static int table2[256];
static int table3[256];

// ran just once to initialize the tables
void initialMod25Tables() {
    for (int i = 0; i < 256; ++i) {
        table0[i] = i % 25;
    }
    for (int i = 0; i < 256; ++i) {
        table1[i] = (i << 8) % 25;
    }
    for (int i = 0; i < 256; ++i) {
        table2[i] = (i << 16) % 25;
    }
    for (int i = 0; i < 256; ++i) {
        table3[i] = (i << 24) % 25;
    }
}

int mod25(int x) {
    int y = table0[x & 0xFF];
    x >>= 8;
    y += table1[x & 0xFF];
    x >>= 8;
    y += table2[x & 0xFF];
    x >>= 8;
    y += table3[x & 0xFF];
    y = table0[y];
    return y;
}
clinux