views:

198

answers:

2
int lcm_old(int a, int b) {
    int n;
    for(n=1;;n++)
        if(n%a == 0 && n%b == 0)
            return n;  
}

int lcm(int a,int b)  {
    int n = 0;
    __asm {
lstart:
        inc n;
        mov eax, n;
        mov edx, 0;
        idiv a;
        mov eax, 0;
        cmp eax, edx;
        jne lstart;
        mov eax, n;
        mov edx, 0;
        idiv b;
        mov eax, 0;
        cmp eax, edx;
        jnz lstart;
    }
    return n;
}

I'm trying to beat/match the code for the top function with my own function (bottom). Have you got any ideas how I can optimize my routine?

PS. This is just for fun.

+13  A: 

I would optimize by using a different algorithm. Searching linearly like you are doing is super-slow. It's a fact that the least common mulitple of two natural numbers is the quotient of their product divided by their greatest common divisor. You can compute the greatest common divisor quickly using the Euclidean algorithm.

Thus:

int lcm(int a, int b) {
    int p = a * b;
    return p / gcd(a, b);
}

where you need to implement gcd(int, int). As the average number of steps in the Euclidean algorithm is O(log n), we beat the naive linear search hands down.

There are other approaches to this problem. If you had an algorithm that could quickly factor integers (say a quantum computer) then you can also solve this problem like so. If you write each of a and b into its canonical prime factorization

a = p_a0^e_a0 * p_a1^e_a1 * ... * p_am^e_am
b = p_b0^e_b0 * p_b1^e_b1 * ... * p_bn^e_bn

then the least common multiple of a and b is the obtained by taking for each prime factor appearing in at least one of the factorizations of a and b, taking it with the maximum exponent that it appears in the factorization of a or b. For example:

28 = 2^2 * 7
312 = 2^3 * 39

so that

lcm(28, 312) = 2^3 * 7 * 39 = 2184

All of this is to point out that naive approaches are admirable in their simplicity but you can spend all day optimizing every last nanosecond out of them and still not beat a superior algorithm.

Jason
+1 And in fact, once you implement the Euclidean algorithm, it's very, very unlikely that hand-written assembly will beat what the compiler generates, unless you compile with optimisations turned off. :-P
Chris Jester-Young
always good to see that a highlevel optimizations beat a low level one hands down. (Which isn't to say that low level optimizations aren't good fun)
Toad
+1 Nice answer!
John Gietzen
@chris: It's a myth that compilers produce better assembly. Humans will always win, especially for specialized little functions. Of course, it takes them much longer to create it.
Toad
@reinier: If the little functions map very cleanly into a higher-level language, there is no reason why the compiler can't make optimal code. It's where the mapping is imperfect, that humans have a chance with custom assembly. :-P
Chris Jester-Young
@reinier: depending on the circumstances, one could equate "it takes much longer to create it" with "it is very, very unlikely".
RaphaelSP
@raphael: with 'it takes longer' I meant: if I click on the compile button, the compiler creates and optimizes tons of codes in a blink of the eye. To manually create hand optimized assembly for just 1 routine, might take minutes if not hours. But, in some cases it is worth it, if every cycle counts, and the functions is used quite a lot (in game development this happens often)
Toad
@raphealSP:If you're saying it's unlikely that anybody's going to bother with hand-optimizing most routines, you're absolutely right. If you're saying it's very unlikely a person could produce better code than a compiler for most routines, then you're absolutely wrong.
Jerry Coffin
Better than the current solution and more common is int lcm(int a, int b) { return (a/gcd(a,b))*b; }, since this avoids overflow problmes when a*b is larger than MAX_INT, but the result is not.
Accipitridae
@reinier and @jerry: obviously, I meant Jerry's #1. Does it happen a lot in game development ? I'll take your word for it. In high-performance image processing ? I'll take mine: we've switched to compiler intrinsics and rarely do we regret it.
RaphaelSP
raphaelsp: Imageprocessing is a very good example where certain handcoded assembly outperforms a C or C++ routine. The best candidates for assembly are small loops which are run so many times, that any small improvement would result in a measurable speed decrease on the total. The investment of rolling your won assembly makes sense than
Toad
@raphaelsp: I can imagine for image processing you should take a look at GPUlibraries as well as this can increase performance by a magnitude of 10 (without resorting to assembly)
Toad
@reinier: this discussion is getting deliciously out of scope... Heh. Image processing is a very good example where a hobbyist or a student can take much time to hand-craft bits of assembly for a specific platform. It is also an excellent example where someone whose does it for a living learns instead to write C code that will be compiled as closely as possible to the optimal assembly sequence, as s/he may not have the resources to spend the time needed to switch to assembly. As for GPU libraries (by the way, our shop also develops one), they are very efficient...
RaphaelSP
... for a specific class of problems. However, not all image processing is embarrassingly parallel (recursive filters are what I'm thinking of right now), and neither are all GPUs able to outperform a CPU in every situation (my ATI FireGL, for instance, is not exactly what I'd call a race horse).
RaphaelSP
Oh well, I guess that my whole point is that in the optimization world, until someone finds a way to compress time, we'll always end up saying "this is good enough" (especially when we want the code to be readable a few months later).
RaphaelSP
+4  A: 

I'm going to assume you want to keep the same algorithm. This should at least be a slightly more efficient implementation of it. The main difference is that the code in the loop only uses registers, not memory.

int lcm(int a,int b)  {
    __asm {
        xor ecx, ecx
        mov esi, a
        mov edi, b
lstart:
        inc ecx
        mov eax, ecx
        xor edx, edx
        idiv esi
        test edx, edx
        jne lstart
        mov eax, ecx;
        idiv edi
        test edx, edx
        jnz lstart
        mov eax, ecx
        leave
        ret
    }
}

As Jason pointed out, however, this really isn't a very efficient algorithm -- multiplying, finding the GCD, and dividing will normally be faster (unless a and b are quite small).

Edit: there is another algorithm that's almost simpler to understand, that should also be a lot faster (than the original -- not than multiplying, then dividing by GCD). Instead of generating consecutive numbers until you find one that divides both a and b, generate consecutive multiples of one (preferably the larger) until you find one that divides evenly by the other:

int lcm2(int a, int b) { 
    __asm { 
        xor ecx, ecx
        mov esi, a
        mov edi, b
    lstart:
        add ecx, esi
        mov eax, ecx
        xor edx, edx
        idiv edi
        test edx, edx
        jnz lstart
        mov eax, ecx
        leave
        ret
    }
}

This remains dead simple to understand, but should give a considerable improvement over the original.

Jerry Coffin