views:

2180

answers:

16

I recently wrote a short algorithm to calculate happy numbers in python. The program allows you to pick an upper bound and it will determine all the happy numbers below it. For a speed comparison I decided to make the most direct translation of the algorithm I knew of from python to c++.

Surprisingly, the c++ version runs significantly slower than the python version. Accurate speed tests between the execution times for discovering the first 10,000 happy numbers indicate the python program runs on average in 0.59 seconds and the c++ version runs on average in 8.5 seconds.

I would attribute this speed difference to the fact that I had to write helper functions for parts of the calculations (for example determining if an element is in a list/array/vector) in the c++ version which were already built in to the python language.

Firstly, is this the true reason for such an absurd speed difference, and secondly, how can I change the c++ version to execute more quickly than the python version (the way it should be in my opinion).

The two pieces of code, with speed testing are here: Python Version, C++ Version. Thanks for the help.

#include <iostream>
#include <vector>
#include <string>
#include <ctime>
#include <windows.h>

using namespace std;

bool inVector(int inQuestion, vector<int> known);
int sum(vector<int> given);
int pow(int given, int power);
void calcMain(int upperBound);

int main()
{
    while(true)
    {
     int upperBound;
     cout << "Pick an upper bound: ";
     cin >> upperBound;
     long start, end;
     start = GetTickCount();
     calcMain(upperBound);
     end = GetTickCount();
     double seconds = (double)(end-start) / 1000.0;
     cout << seconds << " seconds." << endl << endl;
    }
    return 0;
}

void calcMain(int upperBound)
{
    vector<int> known;
    for(int i = 0; i <= upperBound; i++)
    {
     bool next = false;
     int current = i;
     vector<int> history;
     while(!next)
     {
      char* buffer = new char[10];
      itoa(current, buffer, 10);
      string digits = buffer;
      delete buffer;
      vector<int> squares;
      for(int j = 0; j < digits.size(); j++)
      {
       char charDigit = digits[j];
       int digit = atoi(&charDigit);
       int square = pow(digit, 2);
       squares.push_back(square);
      }
      int squaresum = sum(squares);
      current = squaresum;
      if(inVector(current, history))
      {
       next = true;
       if(current == 1)
       {
        known.push_back(i);
        //cout << i << "\t";
       }
      }
      history.push_back(current);
     }
    }
    //cout << "\n\n";
}

bool inVector(int inQuestion, vector<int> known)
{
    for(vector<int>::iterator it = known.begin(); it != known.end(); it++)
     if(*it == inQuestion)
      return true;
    return false;
}

int sum(vector<int> given)
{
    int sum = 0;
    for(vector<int>::iterator it = given.begin(); it != given.end(); it++)
     sum += *it;
    return sum;
}

int pow(int given, int power)
{
    int original = given;
    int current = given;
    for(int i = 0; i < power-1; i++)
     current *= original;
    return current;
}


#!/usr/bin/env python

import timeit

upperBound = 0

def calcMain():
    known = []
    for i in range(0,upperBound+1):
        next = False
        current = i
        history = []
        while not next:
            digits = str(current)
            squares = [pow(int(digit), 2) for digit in digits]
            squaresum = sum(squares)
            current = squaresum
            if current in history:
                next = True
                if current == 1:
                    known.append(i)
                    ##print i, "\t",
            history.append(current)
    ##print "\nend"

while True:    
    upperBound = input("Pick an upper bound: ")
    result = timeit.Timer(calcMain).timeit(1)
    print result, "seconds.\n"
+1  A: 

I am not an expert at C++ optimization, but I believe the speed difference may be due to the fact that Python lists have preallocated more space at the beginning while your C++ vectors must reallocate and possibly copy every time it grows.

As for GMan's comment about find, I believe that the Python "in" operator is also a linear search and is the same speed.

Edit

Also I just noticed that you rolled your own pow function. There is no need to do that and the stdlib is likely faster.

Unknown
I believe vector sizes grow by powers of two when they reallocate.
David
Probably to keep it integer, pow() returns double.
@David: std::vector optimizations are left to the implementation. The standard doesn't specify any performance guarantees on resizing (and powers of two is not usually the optimization used - it's usually a multiple of the platform page size).
Nick Bastin
@Nick: Platform page sizes are usually a power of two - to not say they are always a power of two.Regarding pow returning double: int i = (int)pow(...) solves the problem. Even converting into to double and double back to int is probably still faster as pow is done in hardware.
Mecki
+15  A: 
John
Note that he doesn't need those functions at all.
Potatoswatter
He was probably trying to make the Python and C++ code look as similar as possible.
Could be...which means it's useless for a performance comparison. It's not surprising that Python could be a better Python than C++ is.
David Thornley
+6  A: 

I can see that you have quite a few heap allocations that are unnecessary

For example:

while(!next)
     {
      char* buffer = new char[10];

This doesn't look very optimized. So, you probably want to have the array pre-allocated and using it inside your loop. This is a basic optimizing technique which is easy to spot and to do. It might become into a mess too, so be careful with that.

You are also using the atoi() function, which I don't really know if it is really optimized. Maybe doing a modulus 10 and getting the digit might be better (you have to measure thou, I didn't test this).

The fact that you have a linear search (inVector) might be bad. Replacing the vector data structure with a std::set might speed things up. A hash_set could do the trick too.

But I think that the worst problem is the string and this allocation of stuff on the heap inside that loop. That doesn't look good. I would try at those places first.

Edison Gustavo Muenz
For what it's worth, if you're going to allocate a 10-byte array every time, why not just declare it as a 10-byte array?
Chris Lutz
I agree with Chris. "char buffer[10];" avoids all allocation in the first place.
Mecki
A: 

Here's some food for thought: If given the choice of running a 1979 algorithm for finding prime numbers in a 2009 computer or a 2009 algorithm on a 1979 computer, which would you choose?

The new algorithm on ancient hardware would be the better choice by a huge margin. Have a look at your "helper" functions.

Kristoffon
This may be true in certain circumstances, but it's certainly not true in the case you listed. The Sieve of Eratosthenes is still a relatively optimal algorithm (well known in 1979) - depending on what you mean by "finding prime numbers", anyhow - and 1979 hardware lacked the addressing and registers to optimally support truly modern algorithms. Keep in mind that many modern algorithms are made so by the fact that they leverage impressive improvements in CPU instruction sets and addressing widths.
Nick Bastin
+108  A: 

For 100000 elements, the Python code took 6.9 seconds while the C++ originally took above 37 seconds.

I did some basic optimizations on your code and managed to get the C++ code above 100 times faster than the Python implementation. It now does 100000 elements in 0.06 seconds. That is 617 faster than the original C++ code.

The most important thing is to compile in Release mode, with all optimizations. This code is literally orders of magnitude slower in Debug mode.

Next, I will explain the optimizations I did.

  • Moved all vector declarations outside of the loop; replaced them by a clear() operation, which is much faster than calling the constructor.
  • Replaced the call to pow(value, 2) by a multiplication : value * value.
  • Instead of having a squares vector and calling sum on it, I sum the values in-place using just an integer.
  • Avoided all string operations, which are very slow compared to integer operations. For instance, it is possible to compute the squares of each digit by repeatedly dividing by 10 and fetching the modulus 10 of the resulting value, instead of converting the value to a string and then each character back to int.
  • Avoided all vector copies, first by replacing passing by value with passing by reference, and finally by eliminating the helper functions completely.
  • Eliminated a few temporary variables.
  • And probably many small details I forgot. Compare your code and mine side-by-side to see exactly what I did.

It may be possible to optimize the code even more by using pre-allocated arrays instead of vectors, but this would be a bit more work and I'll leave it as an exercise to the reader. :P

Here's the optimized code :

#include <iostream>
#include <vector>
#include <string>
#include <ctime>
#include <algorithm>
#include <windows.h>

using namespace std;

void calcMain(int upperBound, vector<int>& known);

int main()
{
    while(true)
    {
        vector<int> results;
        int upperBound;
        cout << "Pick an upper bound: ";
        cin >> upperBound;
        long start, end;
        start = GetTickCount();
        calcMain(upperBound, results);
        end = GetTickCount();
        for (size_t i = 0; i < results.size(); ++i) {
            cout << results[i] << ", ";
        }
        cout << endl;
        double seconds = (double)(end-start) / 1000.0;
        cout << seconds << " seconds." << endl << endl;
    }
    return 0;
}

void calcMain(int upperBound, vector<int>& known)
{
    vector<int> history;
    for(int i = 0; i <= upperBound; i++)
    {
        int current = i;
        history.clear();
        while(true)
        {
                int temp = current;
                int sum = 0;
                while (temp > 0) {
                    sum += (temp % 10) * (temp % 10);
                    temp /= 10;
                }
                current = sum;
                if(find(history.begin(), history.end(), current) != history.end())
                {
                        if(current == 1)
                        {
                                known.push_back(i);
                        }
                        break;
                }
                history.push_back(current);
        }
    }
}
Dr_Asik
+1 That's what I call a good answer.
Gareth Davis
Thanks! You're my first positive comment on SO. :)
Dr_Asik
It would probably be better to explain the changes with comments, so it's more clear what is happening.
Wow. That's THE answer.
ilya n.
You should compute temp%10 only once (may be the compiler saw it, who knows...). And if memory if not a issue, we should be able to cache all results thanks to a vector of tribools which will provide a O(1) access to the cached results.
Luc Hermitte
Ok, my algorithm isn't faster, and i think I know why -- my caching only helps with small values :)
ilya n.
+1 - perfect answer :)
Mecki
Stop spamming, please. If people read your answer, they will upvote it. Don't shove it down their throats.
Hooked
Oh, yeah, +1 for pass-by-reference. Some people INSIST that passing by reference does not reduce the overhead of copying.
Hooked
@Hooked: Well, I'm mostly trying to correct my earlier mistake - and I think I corrected it :)
ilya n.
@Hooked: pass by value or reference...a reference is a pointer: a 32 or 64-bit object. If the value is an integer type there's no point in passing by reference (unless the value is being modified). It will actually be slower because of the extra indirection.
Zan Lynx
@Zan: Unless everything I know in this world is false, most compilers will be able to optimize an integer anyways. Since he is passing a non-primitive object (a vector instance), it will speed things up.
Hooked
This code still looks slow to me. Remove all that <vector> and iterator stuff and let's use REAL vectors that generate proper assembly code. Good job on beating Python ! :)
toto
For my pleasure, I did a much faster version using only one array and a few basic instructions, but for the purposes of answering this question, I kept the general structure used by the author so he could easily compare between his version and mine.
Dr_Asik
@toto: Since when did vectors and iterators become inefficient?
jalf
+3  A: 

Well, I also gave it a once-over. I didn't test or even compile, though.

General rules for numerical programs:

  • Never process numbers as text. That's what makes lesser languages than Python slow, so if you do it in C, the program will be slower than Python.

  • Don't use data structures if you can avoid them. You were building an array just to add the numbers up. Better keep a running total.

  • Keep a copy of the STL reference open so you can use it rather than writing your own functions.


void calcMain(int upperBound)
{
    vector<int> known;
    for(int i = 0; i <= upperBound; i++)
    {
     int current = i;
     vector<int> history;
     do
     {
      squaresum = 0
      for ( ; current; current /= 10 )
      {
       int digit = current % 10;
       squaresum += digit * digit;
      }
      current = squaresum;
      history.push_back(current);
     } while ( ! count(history.begin(), history.end() - 1, current) );

     if(current == 1)
     {
      known.push_back(i);
      //cout << i << "\t";
     }

    }
    //cout << "\n\n";
}
Potatoswatter
I like the running total idea.
hughdbrown
Potatoswatter
"lesser languages like Python". What? First of all, Python is a high level language (higher than C, although that qualifies as "high level" as well). That's it - that's the only comparison you can make that has any Computer Science relevance. Also, Python doesn't treat numbers as text, so I don't know where you're even getting off there.
Nick Bastin
Fat Elvis
@Nick: My sentence doesn't even make sense if you change it to that. Read closer next time ;v) .
Potatoswatter
@Elvis: I was just thinking about it again, and although excluding 0 from candidacy would be a fine optimization, it would also work very well to check for <= 4 and then test for == 1 afterward.
Potatoswatter
-1 for "Never process numbers as text. That's what makes lesser languages than Python slow" and "I didn't test or even compile"...
dbr
OK people, "lesser languages THAN Python" means languages which aren't as good as Python. I'm not being impressed with reading comprehension around here…
Potatoswatter
A: 

There are quite a few optimizations possible:

(1) Use const references

bool inVector(int inQuestion, const vector<int>& known)
{
    for(vector<int>::const_iterator it = known.begin(); it != known.end(); ++it)
     if(*it == inQuestion)
      return true;
    return false;
}

int sum(const vector<int>& given)
{
    int sum = 0;
    for(vector<int>::const_iterator it = given.begin(); it != given.end(); ++it)
     sum += *it;
    return sum;
}

(2) Use counting down loops

int pow(int given, int power)
{
    int current = 1;
    while(power--)
     current *= given;
    return current;
}

Or, as others have said, use the standard library code.

(3) Don't allocate buffers where not required

  vector<int> squares;
  for (int temp = current; temp != 0; temp /= 10)
  {
   squares.push_back(pow(temp % 10, 2));
  }
hughdbrown
All I would say here is that const does not affect performance (mainly because const can lie) and using count down loops is no more efficent that counting up loops, the compiler will generate optimal code either way.
Skizz
Not sure what has changed on count-down loops. The compiler can take advantage of instructions that compare to 0 in a single operation.That said, if I were writing this code again, I wish I'd noticed that the only call to pow() always squares the number. Or that the only use of sum() is to add up numbers created in the loop in the line above. I mean, how obvious is that? Should have caught that and eliminated pow() *and* sum(). Or that the inVector() function is obviously replaced by an STL find() call.
hughdbrown
A: 

With similar optimizations as PotatoSwatter I got time for 10000 numbers down from 1.063 seconds to 0.062 seconds (except I replaced itoa with standard sprintf in the original).

With all the memory optimizations (don't pass containers by value - in C++ you have to explicitly decide whether you want a copy or a reference; move operations that allocate memory out of inner loops; if you already have the number in a char buffer, what's the point of copying it to std::string etc) I got it down to 0.532.

The rest of the time came from using %10 to access digits, rather than converting numbers to string.

I suppose there might be another algorithmic level optimization (numbers that you have encountered while finding a happy number are themselves also happy numbers?) but I don't know how much that gains (there is not that many happy numbers in the first place) and this optimization is not in the Python version either.

By the way, by not using string conversion and a list to square digits, I got the Python version from 0.825 seconds down to 0.33 too.

UncleBens
+15  A: 

There's a new, radically faster version as a separate answer, so this answer is deprecated.


I rewrote your algorithm making it cache whenever it finds the number to be happy or unhappy. I also tried to make it as pythonic as I could, for example by creating separate functions digits() and happy(). Sorry for using Python 3, but I get to show off a couple a useful things from it as well.

This version is much faster, though since I'm in the middle of doing some other work and my MacBook is busy, I didn't do exact measures. It runs at 1.7s which is 10 times faster than your original program that takes 18s (well, my MacBook is quite old and slow :) )

(FYI, I've been rewriting this thing for a while and this is like the 5th version :) )

#!/usr/bin/env python3

from timeit import Timer
from itertools import count

print_numbers = False
upperBound = 10**5  # Default value, can be overidden by user.


def digits(x:'nonnegative number') -> "yields number's digits":
    if not (x >= 0): raise ValueError('Number should be nonnegative')
    while x:
        yield x % 10
        x //= 10


def happy(number, known = {1}, happies = {1}) -> 'True/None':
    '''This function tells if the number is happy or not, caching results.

    It uses two static variables, parameters known and happies; the
    first one contains known happy and unhappy numbers; the second 
    contains only happy ones.

    If you want, you can pass your own known and happies arguments. If
    you do, you should keep the assumption commented out on the 1 line.

    '''

#        assert 1 in known and happies <= known  # <= is expensive

    if number in known:
        return number in happies

    history = set()
    while True:
        history.add(number)
        number = sum(x**2 for x in digits(number))
        if number in known or number in history:
            break

    known.update(history)
    if number in happies:
        happies.update(history)
        return True


def calcMain():
    happies = {x for x in range(upperBound) if happy(x) }
    if print_numbers:
        print(happies)


if __name__ == '__main__':
    upperBound = eval(
            input("Pick an upper bound [default {0}]: "
                    .format(upperBound)).strip()
            or repr(upperBound))
    result = Timer(calcMain).timeit(1)
    print ('This computation took {0} seconds'.format(result))
ilya n.
Just benchmarked your solution. On the first run, it is indeed about 10 times faster. On the second run, the code has already been compiled, and it jumps to 200+ times faster. Nice job! It's almost on par with the C++ code I proposed (which runs 600+ times faster) than the original Python implementation.Your code seems correct but, just for nitpicking, you are missing the upper bound, off-by-one error probably. For instance, if I input 1000, 1000 should be among the happy numbers, but the result lists stops at 998.
Dr_Asik
It depends on how you define upperBound :) In Python most functions that accept `stop` parameter usually return elements up to `stop-1`, e.g. `range(N)` or `a[:N]`.
ilya n.
Note that you don't need the history list. Following Wikipedia's happy numbers page, there is only one repeating sequence of numbers besides 1,1,1,1 and 0,0,0. Therefore, you can test number <= 4 and end the loop, and it's a happy number if number == 1.
Potatoswatter
I use history in this function for caching: at the end the whole thing is added into known and, if found happy, happies. Without caching, this function would run slower. See my other answer for radically different optimization.
ilya n.
A: 

Here's a C# version:

using System;
using System.Collections.Generic;
using System.Text;

namespace CSharp
{
  class Program
  {
    static void Main (string [] args)
    {
      while (true)
      {
        Console.Write ("Pick an upper bound: ");

        String
          input = Console.ReadLine ();

        uint
          upper_bound;

        if (uint.TryParse (input, out upper_bound))
        {
          DateTime
            start = DateTime.Now;

          CalcHappyNumbers (upper_bound);

          DateTime
            end = DateTime.Now;

          TimeSpan
            span = end - start;

          Console.WriteLine ("Time taken = " + span.TotalSeconds + " seconds.");
        }
        else
        {
          Console.WriteLine ("Error in input, unable to parse '" + input + "'.");
        }
      }
    }

    enum State
    {
      Happy,
      Sad,
      Unknown
    }

    static void CalcHappyNumbers (uint upper_bound)
    {
      SortedDictionary<uint, State>
        happy = new SortedDictionary<uint, State> ();

      SortedDictionary<uint, bool>
        happy_numbers = new SortedDictionary<uint, bool> ();

      happy [1] = State.Happy;
      happy_numbers [1] = true;

      for (uint current = 2 ; current < upper_bound ; ++current)
      {
        FindState (ref happy, ref happy_numbers, current);
      }

      //foreach (KeyValuePair<uint, bool> pair in happy_numbers)
      //{
      //  Console.Write (pair.Key.ToString () + ", ");
      //}

      //Console.WriteLine ("");
    }

    static State FindState (ref SortedDictionary<uint, State> happy, ref SortedDictionary<uint,bool> happy_numbers, uint value)
    {
      State
        current_state;

      if (happy.TryGetValue (value, out current_state))
      {
        if (current_state == State.Unknown)
        {
          happy [value] = State.Sad;
        }
      }
      else
      {
        happy [value] = current_state = State.Unknown;

        uint
          new_value = 0;

        for (uint i = value ; i != 0 ; i /= 10)
        {
          uint
            lsd = i % 10;

          new_value += lsd * lsd;
        }

        if (new_value == 1)
        {
          current_state = State.Happy;
        }
        else
        {
          current_state = FindState (ref happy, ref happy_numbers, new_value);
        }

        if (current_state == State.Happy)
        {
          happy_numbers [value] = true;
        }

        happy [value] = current_state;
      }

      return current_state;
    }
  }
}

I compared it against Dr_Asik's C++ code. For an upper bound of 100000 the C++ version ran in about 2.9 seconds and the C# version in 0.35 seconds. Both were compiled using Dev Studio 2005 using default release build options and both were executed from a command prompt.

Skizz

Skizz
I get similar times for your C# code, about 0.3 seconds, but for the same amount of data on the same computer, my code runs in 0.047 seconds. That is, of course, the duration of the calcMain routine, as reported by the program, not the total running time of the program, which in its current form, spends 99% of its time just printing the results to the screen, and printing 100000 numbers does take a few seconds.
Dr_Asik
On my machine with VS2008 default release mode for both. I commented the printing section as Dr_Asik said, C# = 0.167 C++ = 0.031
AraK
I think I've beaten both you and Dr_Asik--I've got a C# version that on a warm JIT runs in ~0.05 seconds, according to .NET's StopWatch class (without printing anything). You can find the code at http://pastebin.com/f5616ddda
B.R.
Dr_Asik, I was using the code from the first edit to your post and I see you've re-factored that version into the current, much faster, version whilst I worked on the C# version above.
Skizz
+4  A: 

This is my second answer; which caches things like sum of squares for values <= 10**6:

        happy_list[sq_list[x%happy_base] + sq_list[x//happy_base]]

That is,

  • the number is split into 3 digits + 3 digits
  • the precomputed table is used to get sum of squares for both parts
  • these two results are added
  • the precomputed table is consulted to get the happiness of number:

I don't think Python version can be made much faster than that (ok, if you throw away fallback to old version, that is try: overhead, it's 10% faster).

I think this is an excellent question which shows that, indeed,

  • things that have to be fast should be written in C
  • however, usually you don't need things to be fast (even if you needed the program to run for a day, it would be less then the combined time of programmers optimizing it)
  • it's easier and faster to write programs in Python
  • but for some problems, especially computational ones, a C++ solution, like the ones above, are actually more readable and more beautiful than an attempt to optimize Python program.


Ok, here it goes (2nd version now...):

#!/usr/bin/env python3
'''Provides slower and faster versions of a function to compute happy numbers.

slow_happy() implements the algorithm as in the definition of happy
numbers (but also caches the results).

happy() uses the precomputed lists of sums of squares and happy numbers
to return result in just 3 list lookups and 3 arithmetic operations for
numbers less than 10**6; it falls back to slow_happy() for big numbers.

Utilities: digits() generator, my_timeit() context manager.

'''


from time import time  # For my_timeit.
from random import randint # For example with random number.

upperBound = 10**5  # Default value, can be overridden by user.


class my_timeit:
    '''Very simple timing context manager.'''

    def __init__(self, message):
        self.message = message
        self.start = time()

    def __enter__(self):
        return self

    def __exit__(self, *data):
        print(self.message.format(time() - self.start))


def digits(x:'nonnegative number') -> "yields number's digits":
    if not (x >= 0): raise ValueError('Number should be nonnegative')
    while x:
        yield x % 10
        x //= 10


def slow_happy(number, known = {1}, happies = {1}) -> 'True/None':
    '''Tell if the number is happy or not, caching results.

    It uses two static variables, parameters known and happies; the
    first one contains known happy and unhappy numbers; the second 
    contains only happy ones.

    If you want, you can pass your own known and happies arguments. If
    you do, you should keep the assumption commented out on the 1 line.

    '''
    # This is commented out because <= is expensive.
    # assert {1} <= happies <= known 

    if number in known:
        return number in happies

    history = set()
    while True:
        history.add(number)
        number = sum(x**2 for x in digits(number))
        if number in known or number in history:
            break

    known.update(history)
    if number in happies:
        happies.update(history)
        return True


# This will define new happy() to be much faster ------------------------.

with my_timeit('Preparation time was {0} seconds.\n'):

    LogAbsoluteUpperBound = 6 # The maximum possible number is 10**this.
    happy_list = [slow_happy(x)
                  for x in range(81*LogAbsoluteUpperBound + 1)]
    happy_base = 10**((LogAbsoluteUpperBound + 1)//2)
    sq_list = [sum(d**2 for d in digits(x))
               for x in range(happy_base + 1)]

    def happy(x):
        '''Tell if the number is happy, optimized for smaller numbers.

        This function works fast for numbers <= 10**LogAbsoluteUpperBound.

        '''
        try:
            return happy_list[sq_list[x%happy_base] + sq_list[x//happy_base]]
        except IndexError:
            return slow_happy(x)

# End of happy()'s redefinition -----------------------------------------.


def calcMain(print_numbers, upper_bound):
    happies = [x for x in range(upper_bound + 1) if happy(x)]
    if print_numbers:
        print(happies)


if __name__ == '__main__':
    while True:

        upperBound = eval(input(
            "Pick an upper bound [{0} default, 0 ends, negative number prints]: "
            .format(upperBound)).strip() or repr(upperBound))
        if not upperBound:
            break

        with my_timeit('This computation took {0} seconds.'):
            calcMain(upperBound < 0, abs(upperBound))

        single = 0
        while not happy(single):
            single = randint(1, 10**12)
        print('FYI, {0} is {1}.\n'.format(single,
                    'happy' if happy(single) else 'unhappy')) 

    print('Nice to see you, goodbye!')
ilya n.
I love your examples of fast Python code, and in general I agree with the points you make here. In this particular case however, I find the C++ code to be not only much faster, but also, surprisingly, more straighforward. Just for my personal pleasure (I won't update my above post anymore), I've written an even faster C++ implementation that uses only 14 basic lines of code, using no caching and no STL functions others than push_back. It does 1 million elements in 0.109s. C++ can be beautiful at times, and mathematical algorithms like this are a good candidate for that.
Dr_Asik
When you put it this way, I think you're completely right. I'm actually not happy about the style of my "fast Python version" which just becomes closer and closer to C anyway as I try to optimize -- so there obviously should be a call to external C function. Problem like this is indeed the thing that should be done in languages that map to hardware. One thing I like about Python though is that I inadvertedly created several useful tools, even though I didn't even try to, like `with time_it(...)` - which goes directly into my standard library!
ilya n.
+1  A: 

Here is another way that relies on memorising all the numbers already explored. I obtain a factor x4-5, which is oddly stable against DrAsik's code for 1000 and 1000000, I expected the cache to be more efficient the more numbers we were exploring. Otherwise, the same kind of classic optimizations have been applied. BTW, if the compiler accepts NRVO (/RNVO ? I never remember the exact term) or rvalue references, we wouldn't need to pass the vector as an out parameter.

NB: micro-optimizations are still possible IMHO, and moreover the caching is naive as it allocates much more memory than really needed.

enum Status {
    never_seen,
    being_explored,
    happy,
    unhappy
};

char const* toString[] = { "never_seen", "being_explored", "happy", "unhappy" };


inline size_t sum_squares(size_t i) {
    size_t s = 0;
    while (i) {
        const size_t digit = i%10;
        s += digit * digit;
        i /= 10;
    }
    return s ;
}

struct Cache {
    Cache(size_t dim) : m_cache(dim, never_seen) {}
    void set(size_t n, Status status) {
        if (m_cache.size() <= n) {
            m_cache.resize(n+1, never_seen);
        }
        m_cache[n] = status;
        // std::cout << "(c[" << n << "]<-"<<toString[status] << ")";
    }
    Status operator[](size_t n) const {
        if (m_cache.size() <= n) {
            return never_seen;
        } else {
            return m_cache[n];
        }
    }

private:
    std::vector<Status> m_cache;
};

void search_happy_lh(size_t upper_bound, std::vector<size_t> & happy_numbers)
{
    happy_numbers.clear();
    happy_numbers.reserve(upper_bound); // it doesn't improve much the performances

    Cache cache(upper_bound+1);
    std::vector<size_t> current_stack;

    cache.set(1,happy);
    happy_numbers.push_back(1);
    for (size_t i = 2; i<=upper_bound ; ++i) {
        // std::cout << "\r" << i << std::flush;
        current_stack.clear();
        size_t s= i;
        while ( s != 1 && cache[s]==never_seen)
        {
            current_stack.push_back(s);
            cache.set(s, being_explored);
            s = sum_squares(s);
            // std::cout << " - " << s << std::flush;
        }
        const Status update_with = (cache[s]==being_explored ||cache[s]==unhappy) ? unhappy : happy;
        // std::cout << " => " << s << ":" << toString[update_with] << std::endl;
        for (size_t j=0; j!=current_stack.size(); ++j) {
            cache.set(current_stack[j], update_with);
        }
        if (cache[i] == happy) {
            happy_numbers.push_back(i);
        }
    }
}
Luc Hermitte
I came up with a similar answer with vector<bool>s, but using a native sized int beat the pants off it. (by 4x on my laptop.)
MSN
The reason it doesn't become significantly more efficient the more numbers you explore is probably explained by the fact that a number above 1000 becomes smaller the first time it is summed. I don't think it's necessary to cache anything above 1000.
Jay Keegan
Indeed, I think it is tied to the fact numbers up to 10^n will be reduced to numbers smaller than 81*n. Which is quite small. What surprised me is that with upperbounds of 10^3, 10^6 and 10^8 there was an almost linear speedup factor between x4.6 and x5.2. When I have the time, I'll check with bigger numbers.
Luc Hermitte
+2  A: 

Just to get a little more closure on this issue by seeing how fast I could truely find these numbers, I wrote a multithreaded C++ implementation of Dr_Asik's algorithm. There are two things that are important to realize about the fact that this implementation is multithreaded.

  1. More threads does not necessarily lead to better execution times, there is a happy medium for every situation depending on the volume of numbers you want to calculate.

  2. If you compare the times between this version running with one thread and the original version, the only factors that could cause a difference in time are the overhead from starting the thread and variable system performance issues. Otherwise, the algorithm is the same.

The code for this implementation (all credit for the algorithm goes to Dr_Asik) is here. Also, I wrote some speed tests with a double check for each test to help back up those 3 points.

Calculation of the first 100,000,000 happy numbers:

Original - 39.061 / 39.000 (Dr_Asik's original implementation)
1 Thread - 39.000 / 39.079
2 Threads - 19.750 / 19.890
10 Threads - 11.872 / 11.888
30 Threads - 10.764 / 10.827
50 Threads - 10.624 / 10.561 <--
100 Threads - 11.060 / 11.216
500 Threads - 13.385 / 12.527

From these results it looks like our happy medium is about 50 threads, plus or minus ten or so.

Jay Keegan
What do you get if you remove the Sleep(1000)?
MSN
I regret that mistake, which stemmed from earlier debugging and I am fixing it, along with some wording on the post right now.
Jay Keegan
I get a better speedup (x5) just by caching the numbers explored (see my post). More threads is not necesserally better. No need to have more threads than core/processors. Moreover there should be a big access concurrency of the first 81*log10(upperbound) (=9*81) numbers. For the following numbers multiple threads could be used without any lock! (see Herb Sutter's column on concurrency).BTW new vector<> is an useless slowdown of the code.
Luc Hermitte
+1  A: 

Other optimizations: by using arrays and direct access using the loop index rather than searching in a vector, and by caching prior sums, the following code (inspired by Dr Asik's answer but probably not optimized at all) runs 2445 times faster than the original C++ code, about 400 times faster than the Python code.

#include <iostream>
#include <windows.h>
#include <vector>

void calcMain(int upperBound, std::vector<int>& known)
{
    int tempDigitCounter = upperBound;
    int numDigits = 0;
    while (tempDigitCounter > 0)
    {
     numDigits++;
     tempDigitCounter /= 10;
    }
    int maxSlots = numDigits * 9 * 9;
    int* history = new int[maxSlots + 1];

    int* cache = new int[upperBound+1];
    for (int jj = 0; jj <= upperBound; jj++)
    {
     cache[jj] = 0;
    }

    int current, sum, temp;
    for(int i = 0; i <= upperBound; i++)
    {
        current = i;
     while(true)
        {
      sum = 0;
      temp = current;

      bool inRange = temp <= upperBound;
      if (inRange)
      {
       int cached = cache[temp];
       if (cached)
       {
        sum = cached;
       }
      }

      if (sum == 0)
      {
       while (temp > 0)
       {
        int tempMod = temp % 10;
        sum += tempMod * tempMod;
        temp /= 10;
       }
       if (inRange)
       {
        cache[current] = sum;
       }
      }
      current = sum;
         if(history[current] == i)
         {
          if(current == 1)
          {
        known.push_back(i);
          }
          break;
      }
         history[current] = i;
        }
    }
}

int main()
{
    while(true)
    {
        int upperBound;
     std::vector<int> known;
     std::cout << "Pick an upper bound: ";
     std::cin >> upperBound;
     long start, end;
     start = GetTickCount();
        calcMain(upperBound, known);
     end = GetTickCount();
     for (size_t i = 0; i < known.size(); ++i) {
      std::cout << known[i] << ", ";
        }      
     double seconds = (double)(end-start) / 1000.0;
     std::cout << std::endl << seconds << " seconds." << std::endl << std::endl;
    }
    return 0;
}
JRL
A: 

#!/usr/bin/env python

import timeit

upperBound = 0

def calcMain():
    known = set()
    for i in xrange(0,upperBound+1):
        next = False
        current = i
        history = set()
        while not next:
            squaresum=0
            while current > 0:
                current, digit = divmod(current, 10)
                squaresum += digit * digit
            current = squaresum
            if current in history:
                next = True
                if current == 1:
                    known.add(i)
            history.add(current)

while True:
    upperBound = input("Pick an upper bound: ")
    result = timeit.Timer(calcMain).timeit(1)
    print result, "seconds.\n"

I made a couple of minor changes to your original python code example that make a better than 16x improvement to the performance of the code. The changes I made took the 100,000 case from about 9.64 seconds to about 3.38 seconds.

The major change was to make the mod 10 and accumulator changes to run in a while loop. I made a couple of other changes that improved execution time in only fractions of hundredths of seconds. The first minor change was changing the main for loop from a range list comprehension to an xrange iterator. The second minor change was substituting the set class for the list class for both the known and history variables. I also experimented with iterator comprehensions and precalculating the squares but they both had negative effects on the efficiency. I seem to be running a slower version of python or on a slower processor than some of the other contributers. I would be interest in the results of someone else's timing comparison of my python code against one of the optimized C++ versions of the same algorithm. I also tried using the python -O and -OO optimizations but they had the reverse of the intended effect.

freegnu
A: 

Why is everyone using a vector in the c++ version? Lookup time is O(N).

Even though it's not as efficient as the python set, use std::set. Lookup time is O(log(N)).

robev