views:

980

answers:

15
+14  Q: 

Self numbers in c++

Hey, my friends and I are trying to beat each other's runtimes for generating "Self Numbers" between 1 and a million. I've written mine in c++ and I'm still trying to shave off precious time.

Here's what I have so far,

#include <iostream>

using namespace std;

bool v[1000000];
int main(void) {
  long non_self = 0;
  for(long i = 1; i < 1000000; ++i) {
    if(!(v[i])) std::cout << i << '\n';
    non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
    v[non_self] = 1;
  }
  std::cout << "1000000" << '\n';
  return 0;
}

The code works fine now, I just want to optimize it. Any tips? Thanks.

+4  A: 

If you want your output to be fast, it may be worth investigating replacing iostream output with plain old printf() - depends on the rules for winning the competition whether this is important.

anon
+1  A: 

For such simple task, the best option would be to think of alternative algorithms to produce the same result. %10 is not usually considered a fast operation.

kauppi
I don't know what you work on, but modulo operator is absolute basic and must work fast. 4M modulo operations + 4M assignments completed within 50ms and 4M assignments completed within 30ms on a 1800XP processor here.
Pasi Savolainen
+1  A: 

Why not use the recurrence relation given on the wikipedia page instead? That should be blazingly fast.

EDIT: Ignore this .. the recurrence relation generates some but not all of the self numbers. In fact only very few of them. Thats not particularly clear from thewikipedia page though :(

Michael Anderson
Because it doesn't generate all the self numbers ?
High Performance Mark
The recurrence does not generate *all* self numbers. All the numbers it generates are self-numbers.
Alok
Yeah, a quick check shows that C1 = 9 buy C2 = 97, and it increases every cycle, so misses out all those self numbers in between 9 and 97 etc. No joy.
Alex Brown
+12  A: 

Generate the numbers once, copy the output into your code as a gigantic string. Print the string.

Jimmy
This is also the way to cheat the Project Euler "2 minute rule". Write a program which takes a week to run, but outputs a program that itself takes a fraction of a second to print the same result. After all, Project Euler has no rule that your toolchain must take less than 2 minutes to run, just your final program.
Steve Jessop
Maybe it's changed, but when I was active in Project Euler, it was a 1-minute rule, and a mere suggestion that if your program took more than that, then you were taking a bad approach to solving it.
Wallacoloo
Sorry, yes, one minute, it's a while since I did any problems either. And nobody at Project Euler actually cares or asks to look at your code, which probably isn't the case in this question. My point is just that in both cases it's obviously smart-alec cheating, which isn't to say that's not a good approach if you think you can blag it :-)
Steve Jessop
+12  A: 

Those mods (%) look expensive. If you are allowed to move to base 16 (or even base 2), then you can probably code this a lot faster. If you have to stay in decimal, try creating an array of digits for each place (units, tens, hundreds) and build some rollover code. That will make summating the numbers far easier.


Alternatively, you could recognise the behaviour of the core self function (let's call it s):

s = n + f(b,n)

where f(b,n) is the sum of the digits of the number n in base b.

For base 10, it's clear that as the ones (also known as least significant) digit moves from 0,1,2,...,9, that n and f(b,n) proceed in lockstep as you move from n to n+1, it's only that 10% of the time that 9 rolls to 0 that it doesnt, so:

f(b,n+1) = f(b,n) + 1  // 90% of the time

thus the core self function s advances as

n+1 + f(b,n+1) = n + 1 + f(b,n) + 1 = n + f(b,n) + 2

s(n+1) = s(n) + 2 // again, 90% of the time

In the remaining (and easily identifiable) 10% of the time, the 9 rolls back to zero and adds one to the next digit, which in the simplest case subtracts (9-1) from the running total, but might cascade up through a series of 9s, to subtract 99-1, 999-1 etc.

So the first optimisation can remove most of the work from 90% of your cycles!

if ((n % 10) != 0) 
{
  n + f(b,n) = n-1 + f(b,n-1) + 2;
}

or

if ((n % 10) != 0)
{
  s = old_s + 2;
}

That should be enough to substantially increase your performance without really changing your algorithm.

If you want more, then work out a simple algorithm for the change between iterations for the remaining 10%.

Alex Brown
+1, good work there!
Carl Smotricz
A: 

I wonder if multi-threading would help. This algorithm looks like it would lend itself well to multi-threading. (Poor-man's test of this: Create two copies of the program and run them at the same time. If it runs in less than 200% of the time, multi-threading may help).

Brian
This algorithm is ripe for a data parallel implementation. See my CUDA implementation in this post.
John Dibling
A: 

Maybe try just computing the recurrence relation defined below?

http://en.wikipedia.org/wiki/Self_number

miko
fwiw, this solution won't generate *all* self numbers, it will simply generate *valid* self numbers. _I didn't downvote you_, but you might consider doing more than linking in the future.
Mark E
+2  A: 

Multithread (use different arrays/ranges for every thread). Also, dont use more threads than your number of cpu cores =)

Viktor Sehr
each range partially depends on the flag being set from the ranges before it though. You'd have to do a bit of intelligent overlapping.
Jimmy
In the OP's algorithm, yes - and I must understand I never understood how that worked. I managed to do away with that range propagation requirement, so my solution would indeed be amenable to multithreading.
Carl Smotricz
+2  A: 

cout or printf within a loop will be slow. If you can remove any prints from a loop you will see significant performance increase.

Brian T Hannan
+23  A: 

I built an alternate C solution that doesn't require any modulo or division operations:

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]) {
   int v[1100000];
   int j1, j2, j3, j4, j5, j6, s, n=0;
   memset(v, 0, sizeof(v));
   for (j6=0; j6<10; j6++) {
      for (j5=0; j5<10; j5++) {
         for (j4=0; j4<10; j4++) {
            for (j3=0; j3<10; j3++) {
               for (j2=0; j2<10; j2++) {
                  for (j1=0; j1<10; j1++) {
                     s = j6 + j5 + j4 + j3 + j2 + j1;
                     v[n + s] = 1;
                     n++;
                  }
               }
            }
         }
      }
   }
   for (n=1; n<=1000000; n++) {
      if (!v[n]) printf("%6d\n", n);
   }
}

It generates 97786 self numbers including 1 and 1000000.
With output, it takes

real        0m1.419s
user        0m0.060s
sys         0m0.152s

When I redirect output to /dev/null, it takes

real     0m0.030s
user     0m0.024s
sys      0m0.004s

on my 3 Ghz quad core rig.

For comparison, your version produces the same number of numbers, so I assume we're either both correct or equally wrong; but your version chews up

real    0m0.064s
user    0m0.060s
sys     0m0.000s

under the same conditions, or about 2x as much.

That, or the fact that you're using longs, which is unnecessary on my machine. Here, int goes up to 2 billion. Maybe you should check INT_MAX on yours?

Update

I had a hunch that it may be better to calculate the sum piecewise. Here's my new code:

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]) {
   char v[1100000];
   int j1, j2, j3, j4, j5, j6, s, n=0;
   int s1, s2, s3, s4, s5;
   memset(v, 0, sizeof(v));
   for (j6=0; j6<10; j6++) {
      for (j5=0; j5<10; j5++) {
         s5 = j6 + j5;
         for (j4=0; j4<10; j4++) {
            s4 = s5 + j4;
            for (j3=0; j3<10; j3++) {
               s3 = s4 + j3;
               for (j2=0; j2<10; j2++) {
                  s2 = s3 + j2;
                  for (j1=0; j1<10; j1++) {
                     v[s2 + j1 + n++] = 1;
                  }
               }
            }
         }
      }
   }
   for (n=1; n<=1000000; n++) {
      if (!v[n]) printf("%d\n", n);
   }
}

...and what do you know, that brought down the time for the top loop from 12 ms to 4 ms. Or maybe 8, my clock seems to be getting a bit jittery way down there.

State of affairs, Summary

The actual finding of self numbers up to 1M is now taking roughly 4 ms, and I'm having trouble measuring any further improvements. On the other hand, as long as output is to the console, it will continue to take about 1.4 seconds, my best efforts to leverage buffering notwithstanding. The I/O time so drastically dwarfs computation time that any further optimization would be essentially futile. Thus, although inspired by further comments, I've decided to leave well enough alone.

All times cited are on my (pretty fast) machine and are for comparison purposes with each other only. Your mileage may vary.

Carl Smotricz
Very nicely done.
John Dibling
long and int are both 32 bit on my machine, but you had the same idea I had. I also thought that it may or may not be quicker to build up a single huge string and then print it in one go, rather than doing all those little 7-or-less byte writes, which may or may not be efficiently buffered to stdout.
Steve Jessop
Oh yes, and if significant time is being spent in the first phase, then adding together 6 numbers in the innermost loop might be expensive. You'd have to look at the disassembly, but e.g. x86 is not swimming in registers. Might be quicker to initialise s to 0, increment it by 1 in the innermost loop, and subtract 9 from it at the end of each outer loop. Also, if I wasn't about to go to bed I'd try fully unrolling the innermost loop, so get rid of j1 and do `int total = n + s;` followed by 10 copies of `v[total] = 1; total += 2;`
Steve Jessop
@Steve Jessop: I commented out the bottom loop and user time dropped to about 50% from 24 ms to 12.
Carl Smotricz
So about 50-50 calculating and printing (to /dev/null), before the additional optimisations took off 4-8ms. Next step is to look at the output, then. If the challenge is wall clock time to print the results, then I/O is the only thing that ever mattered, and the way to win is to find the fastest terminal (yawn). If the challenge is CPU time to build the array, then you're finished since you've hit the granularity of your timer :-) So at this point the question has to be nailed down a bit more.
Steve Jessop
Another thought - by looping downwards instead of upwards, you can avoid the initial memset: set v[n] to 0 and v[n+s] to 1. Then the calculation is only making a single pass over the array, which might help the cache a bit.
Steve Jessop
At least on my Linux system, default buffering for stdout is line buffered (= IO on each \n). So I had high hopes on `setvbuf()` with a 1 MB buffer. But the effect was negligible, alas. I can slow the program down a lot, though, if I set it to "no buffering" (unsurprisingly; I was just checking if I was having *any* effect).
Carl Smotricz
At the cost of perhaps getting incorrect results, I commented `memset` out. No discernible difference. But you gave me an idea: I changed the type of `v` to `char` and now user time won't budge from 4 ms. Even if I reintroduce `memset`.
Carl Smotricz
Keep the memset. You can get even more speed from GCC 4 and later by doing a build with -fprofile-generate, run the program and then compile with -fprofile-use. The for-pay versions of Microsoft C++ have the same ability.
Zan Lynx
So: I can't speed up IO (the real bottleneck) but I brought the computation down to the granularity of my timer. I think I should call it a night now.
Carl Smotricz
@Zan Lynx: I'm not interested in speeding up the actual execution, as I'm trying to tune the algorithms. All timing is on a level field, same machine, no compiler options, so I can see my improvements. In fact, I almost wish I could slow the sucker down so I could get more meaningful timing results!
Carl Smotricz
On my Itanium2 g++ -O3 -fprofile-use unrolled a huge amount of code and used almost 30 registers.
Zan Lynx
@Carl: If you need it to run longer for timing reasons give it an outer loop to rerun the calculation several times. I'd make the outer loop depend on a command-line argument so it does not get optimized away.
Zan Lynx
@Zan Lynx: The "mission" is to get this program to run faster on the OP's system than his competitors run theirs. So we could advise him to turn on the tuning options for whatever compiler he may have. In any event, I was about to call it a night, he's already quite gotten his money's worth of assistance in cheating his buddies :)
Carl Smotricz
+1 nice solution. Has the same useful effect asine, without al the hand waving-the soltuion looks like the problem.
Alex Brown
Thank you! I must admit that, as you started to refine your solution I stopped understanding it. I'm glad I was able to arrive at a useful solution by intuition, trial and error (with help from commenters, thanks!).
Carl Smotricz
that's a shame, I was refining it to make it easier to understand. FAIL.
Alex Brown
The failure is all mine; I sometimes have trouble concentrating on too much detail at once, and my eyes glaze over.
Carl Smotricz
Don't forget `std::vector<bool>`. Reducing that buffer by 8x might help the cache to work better. You can even practically find the numbers up to a billion. Edit: or its lesser-known cousin `std::bitset<N>`
Potatoswatter
… Tried it out. After 26 seconds, found these self numbers: 999999904999999915999999926999999937999999948999999959999999970999999972999999983999999994
Potatoswatter
No need to resort to C++ for that, even. I'd set up an array of 200M `short`s, unroll the inner loop and `OR` in constants from 2^0 to 2^9 into the element "responsible" for a set of 10 numbers. While being a little less efficient with memory, this would save me the re-allocations a dynamic data structure would undergo, plus all of its slightly more expensive bit-twiddling logic.
Carl Smotricz
+1  A: 

This may help speed up C++ iostreams output:

cin.tie(0);
ios::sync_with_stdio(false);

Put them in main before you start writing to cout.

Zan Lynx
+2  A: 

Since the range is limited (1 to 1000000) the maximum sum of the digits does not exceed 9*6 = 54. This means that to implement the sieve a circular buffer of 54 elements should be perfectly sufficient (and the size of the sieve grows very slowly as the range increases).

You already have a sieve-based solution, but it is based on pre-building the full-length buffer (sieve of 1000000 elements), which is rather inelegant (if not completely unacceptable). The performance of your solution also suffers from non-locality of memory access.

For example, this is a possible very simple implementation

#define N 1000000U

void print_self_numbers(void)
{
  #define NMARKS 64U /* make it 64 just in case (and to make division work faster :) */

  unsigned char marks[NMARKS] = { 0 };
  unsigned i, imark;

  for (i = 1, imark = i; i <= N; ++i, imark = (imark + 1) % NMARKS)
  {
    unsigned digits, sum;

    if (!marks[imark])
      printf("%u ", i);
    else
      marks[imark] = 0;

    sum = i;
    for (digits = i; digits > 0; digits /= 10)
      sum += digits % 10;

    marks[sum % NMARKS] = 1;
  }
}

(I'm not going for the best possible performance in terms of CPU clocks here, just illustrating the key idea with the circular buffer.)

Of course, the range can be easily turned into a parameter of the function, while the size of the curcular buffer can be easily calculated at run-time from the range.

As for "optimizations"... There's no point in trying to optimize the code that contains I/O operations. You won't achieve anything by such optimizations. If you want to analyze the performance of the algorithm itself, you'll have to put the generated numbers into an output array and print them later.

AndreyT
A: 

I was actually surprised that the code below was faster then any other posted here. I probably measured it wrong, but maybe it helps; or at least is interesting.

#include <iostream>
#include <boost/progress.hpp>

class SelfCalc
{
private:
    bool    array[1000000];
    int     non_self;

public:
    SelfCalc()
    {
        memset(&array, 0, sizeof(array));
    }

    void operator()(const int i)
    {
        if (!(array[i]))
            std::cout << i << '\n';

        non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
        array[non_self] = true;
    }
};

class IntIterator
{
private:
    int value;

public: 
    IntIterator(const int _value):value(_value){}

    int operator*(){ return value; } 
    bool operator!=(const IntIterator &v){ return value != v.value; }
    int operator++(){ return ++value; }
};

int main()
{
    boost::progress_timer t;

    SelfCalc selfCalc;
    IntIterator i(1), end(100000);

    std::for_each(i, end, selfCalc);

    std::cout << 100000 << std::endl;
    return 0;
}
Gianni
A: 

Fun problem. The problem as stated does not specify what base it must be in. I fiddled around with it some and wrote a base-2 version. It generates an extra few thousand entries because the termination point of 1,000,000 is not as natural with base-2. This pre-counts the number of bits in a byte for a table lookup. The generation of the result set (without the I/O) took 2.4 ms.

One interesting thing (assuming I wrote it correctly) is that the base-2 version has about 250,000 "self numbers" up to 1,000,000 while there are just under 100,000 base-10 self numbers in that range.

#include <windows.h>
#include <stdio.h>
#include <string.h>

void StartTimer( _int64 *pt1 )
{
   QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}

double StopTimer( _int64 t1 )
{
   _int64 t2, ldFreq;

   QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
   QueryPerformanceFrequency( (LARGE_INTEGER*)&ldFreq );
   return ((double)( t2 - t1 ) / (double)ldFreq) * 1000.0;
}

#define RANGE 1000000

char sn[0x100000 + 32];

int bitCount[256];

 // precompute bitcounts for each byte
void PreCountBits()
{
    int i;

    // generate count of bits in each byte
    memset( bitCount, 0, sizeof( bitCount ));
    for ( i = 0; i < 256; i++ )
        {
        int tmp = i;

        while ( tmp )
            {
            if ( tmp & 0x01 )
                bitCount[i]++;
            tmp >>= 1;
            }
        }
}


void GenBase2( )
{
    int i;
    int *b1, *b2, *b3;
    int b1sum, b2sum, b3sum;

    i = 0;
    for ( b1 = bitCount; b1 < bitCount + 256; b1++ )
        {
        b1sum = *b1;
        for ( b2 = bitCount; b2 < bitCount + 256; b2++ )
            {
            b2sum = b1sum + *b2;
            for ( b3 = bitCount; b3 < bitCount + 256; b3++ )
                {
                sn[i++ + *b3 + b2sum] = 1;
                }
            }

        // 1000000 does not provide a great termination number for base 2.  So check
        // here.  Overshoots the target some but avoids repeated checks
        if ( i > RANGE )
            return;
        }
}


int main( int argc, char* argv[] )
{
    int i = 0;
    __int64 t1;


    memset( sn, 0, sizeof( sn ));
    StartTimer( &t1 );
    PreCountBits();
    GenBase2();
    printf( "Generation time = %.3f\n", StopTimer( t1 ));

    #if 1
    for ( i = 1; i <= RANGE; i++ )
        if ( !sn[i] ) printf( "%d\n", i );
    #endif
    return 0;
}
Mark Wilkins
+1  A: 

I created a CUDA-based solution based on Carl Smotricz's second algorithm. The code to identify Self Numbers itself is extremely fast -- on my machine it executes in ~45 nanoseconds; this is about 150 x faster than Carl Smotricz's algorithm, which ran in 7 milliseconds on my machine.

There is a bottleneck, however, and that seems to be the PCIe interface. It took my code a whopping 43 milliseconds to move the computed data from the graphics card back to RAM. This might be optimizable, and I will look in to this.

Still, 45 nanosedons is pretty darn fast. Scary fast, actually, and I added code to my program which runs Carl Smotricz's algorithm and compares the results for accuracy. The results are accurate. Here is the program output (compiled in VS2008 64-bit, Windows7):

UPDATE

I recompiled this code in release mode with full optimization and using static runtime libraries, with signifigant results. The optimizer seems to have done very well with Carl's algorithm, reducing the runtime from 7 ms to 1 ms. The CUDA implementation sped up as well, from 35 us to 20 us. The memory copy from video card to RAM was unaffected.

Program Output:

Running on device: 'Quadro NVS 295'
Reference Implementation Ran In 15603 ticks (7 ms)
Kernel Executed in 40 ms -- Breakdown:
  [kernel] : 35 us (0.09%)
  [memcpy] : 40 ms (99.91%)
CUDA Implementation Ran In 111889 ticks (51 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0

The code is as follows:

file : main.h

#pragma once

#include <cstdlib>
#include <functional>

typedef std::pair<int*, size_t> sized_ptr;
static sized_ptr make_sized_ptr(int* ptr, size_t size)
{
    return make_pair<int*, size_t>(ptr, size);
}

__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMemory, unsigned const blocks, unsigned const threads);

inline std::string format_elapsed(double d) 
{
    char buf[256] = {0};

    if( d < 0.00000001 )
    {
        // show in ps with 4 digits
        sprintf(buf, "%0.4f ps", d * 1000000000000.0);
    }
    else if( d < 0.00001 )
    {
        // show in ns
        sprintf(buf, "%0.0f ns", d * 1000000000.0);
    }
    else if( d < 0.001 )
    {
        // show in us
        sprintf(buf, "%0.0f us", d * 1000000.0);
    }
    else if( d < 0.1 )
    {
        // show in ms
        sprintf(buf, "%0.0f ms", d * 1000.0);
    }
    else if( d <= 60.0 )
    {
        // show in seconds
        sprintf(buf, "%0.2f s", d);
    }
    else if( d < 3600.0 )
    {
        // show in min:sec
        sprintf(buf, "%01.0f:%02.2f", floor(d/60.0), fmod(d,60.0));
    }
    // show in h:min:sec
    else 
        sprintf(buf, "%01.0f:%02.0f:%02.2f", floor(d/3600.0), floor(fmod(d,3600.0)/60.0), fmod(d,60.0));

    return buf;
}

inline std::string format_pct(double d)
{
    char buf[256] = {0};
    sprintf(buf, "%.2f", 100.0 * d);
    return buf;
}

file: main.cpp

#define _CRT_SECURE_NO_WARNINGS 

#include <windows.h>
#include "C:\CUDA\include\cuda_runtime.h"
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
#include <cmath>
#include <map>
#include <algorithm>
#include <list>

#include "main.h"

int main()
{
    unsigned numVals = 1000000;
    int* gold = new int[numVals];
    memset(gold, 0, sizeof(int)*numVals);

    LARGE_INTEGER li = {0}, li2 = {0};
    QueryPerformanceFrequency(&li);
    __int64 freq = li.QuadPart;

    // get cuda properties...
    cudaDeviceProp cdp = {0};
    cudaError_t err = cudaGetDeviceProperties(&cdp, 0);
cout << "Running on device: '" << cdp.name << "'" << endl;

    // first run the reference implementation
    QueryPerformanceCounter(&li);
    for( int j6=0, n = 0; j6<10; j6++ ) 
    {
        for( int j5=0; j5<10; j5++ ) 
        {
            for( int j4=0; j4<10; j4++ ) 
            {
                for( int j3=0; j3<10; j3++ ) 
                {
                    for( int j2=0; j2<10; j2++ ) 
                    {
                        for( int j1=0; j1<10; j1++ )  
                        {
                            int s = j6 + j5 + j4 + j3 + j2 + j1;
                            gold[n + s] = 1;
                            n++;
                        }
                    }
                }
            }
        }
    }
    QueryPerformanceCounter(&li2);
    __int64 ticks = li2.QuadPart-li.QuadPart;
    cout << "Reference Implementation Ran In " << ticks << " ticks" << " (" << format_elapsed((double)ticks/(double)freq) << ")" << endl;

    // now run the cuda version...
    unsigned threads = cdp.maxThreadsPerBlock;
    unsigned blocks = numVals/threads;
    if( numVals%threads ) ++blocks;
    unsigned computeSlots = blocks * threads;   // this may be != the number of vals since we want 32-thread warps

    // allocate device memory for test
    int* deviceTest = 0;
    err = cudaMalloc(&deviceTest, sizeof(int)*computeSlots);
    err = cudaMemset(deviceTest, 0, sizeof(int)*computeSlots);

    int* hostTest = new int[numVals];   // the repository for the resulting data on the host
    memset(hostTest, 0, sizeof(int)*numVals);

    // run the CUDA code...
    LARGE_INTEGER li3 = {0}, li4={0};
    QueryPerformanceCounter(&li3);
    ComputeSelfNumbers(make_sized_ptr(hostTest, numVals), make_sized_ptr(deviceTest, computeSlots), blocks, threads);
    QueryPerformanceCounter(&li4);

    __int64 ticksCuda = li4.QuadPart-li3.QuadPart;
    cout << "CUDA Implementation Ran In " << ticksCuda << " ticks" << " (" << format_elapsed((double)ticksCuda/(double)freq) << ")" << endl;
    cout << "Compute Slots: " << computeSlots << " (" << blocks << " blocks X " << threads << " threads)" << endl;


    unsigned errorCount = 0;
    for( size_t i = 0; i < numVals; ++i )
    {
        if( gold[i] != hostTest[i] )
        {
            ++errorCount;
        }
    }

    cout << "Number of Errors: " << errorCount << endl;

    return 0;
}

file: self.cu

#pragma warning( disable : 4231)
#include <windows.h>
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <iomanip>
using namespace std;
#include "main.h"

__global__ void SelfNum(int * slots)
{
    __shared__ int N;
    N = (blockIdx.x * blockDim.x) + threadIdx.x;

    const int numDigits = 10;

    __shared__ int digits[numDigits];
    for( int i = 0, temp = N; i < numDigits; ++i, temp /= 10 )
    {
        digits[numDigits-i-1] = temp - 10 * (temp/10)      /*temp % 10*/;
    }

    __shared__ int s;
    s = 0;
    for( int i = 0; i < numDigits; ++i )
        s += digits[i];

    slots[N+s] = 1;
}

__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMem, const unsigned  blocks, const unsigned threads)
{
    LARGE_INTEGER li = {0};
    QueryPerformanceFrequency(&li);
    double freq = (double)li.QuadPart;

    LARGE_INTEGER liStart = {0};
    QueryPerformanceCounter(&liStart);

    // run the kernel
    SelfNum<<<blocks, threads>>>(deviceMem.first);
    LARGE_INTEGER liKernel = {0};
    QueryPerformanceCounter(&liKernel);

    cudaMemcpy(hostMem.first, deviceMem.first, hostMem.second*sizeof(int), cudaMemcpyDeviceToHost); // dont copy the overflow - just throw it away
    LARGE_INTEGER liMemcpy = {0};
    QueryPerformanceCounter(&liMemcpy);

    // display performance stats
    double e = double(liMemcpy.QuadPart - liStart.QuadPart)/freq,
        eKernel = double(liKernel.QuadPart - liStart.QuadPart)/freq,
        eMemcpy = double(liMemcpy.QuadPart - liKernel.QuadPart)/freq;

    double pKernel = eKernel/e,
        pMemcpy = eMemcpy/e;

    cout << "Kernel Executed in " << format_elapsed(e) << " -- Breakdown: " << endl
        << "  [kernel] : " << format_elapsed(eKernel) << " (" << format_pct(pKernel) << "%)" << endl
        << "  [memcpy] : " << format_elapsed(eMemcpy) << " (" << format_pct(pMemcpy) << "%)" << endl;



}

UPDATE2:

I refactored my CUDA implementation to try to speed it up a bit. I did this by unrolling loops manually, fixing some questionable use of __shared__ memory which might have been an error, and getting rid of some redundancy.

The output of my new kernel is:

Reference Implementation Ran In 69610 ticks (5 ms)
Kernel Executed in 2 ms -- Breakdown:
  [kernel] : 39 us (1.57%)
  [memcpy] : 2 ms (98.43%)
CUDA Implementation Ran In 62970 ticks (4 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0

The only code I changed is the kernel itself, so that's all I will post here:

__global__ void SelfNum(int * slots)
{
    int N = (blockIdx.x * blockDim.x) + threadIdx.x;

    int s = 0;

    int temp = N;
    s += temp - 10 * (temp/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;

    slots[N+s] = 1;
}
John Dibling