views:

248

answers:

1

NOTE: This is a solution for Project Euler problem 14. If you still want to solve it yourself then don't read on.

The problem was to find the number below one million which, as starting number for the Collatz sequence, produces the longest such sequence. My initial code was as follows:

$r = @{}

for($i = 1; $i -lt 1000000; $i++) {
    $n = 0
    $a = $i
    while ($a -gt 1) {
        if ($r[$a]) {
            $n += $r[$a]
            break
        }
        if ($a % 2 -eq 0) {
            $a /= 2
        } else {
            $a = $a * 3 + 1
        }
        $n++
    }
    $r[$i] = $n
}

$r.GetEnumerator() | sort Value | select -l 1 | %{$_.Key}

which tries to use a hashtable as cache for sub-sequences already encountered to save time. I was quite surprised when that script had a runtime of over eight minutes on my machine. Re-creating the same code in C#:

using System;
using System.Collections.Generic;

class Problem14
{
    public static void Main()
    {
        var ht = new Dictionary<long, int>();

        for (int i = 1; i < 1000000; i++)
        {
            int count = 0;
            long j = i;

            while (j > 1)
            {
                if (ht.ContainsKey(j))
                {
                    count += ht[j];
                    break;
                }
                if (j % 2 == 0)
                    j /= 2;
                else
                    j = 3 * j + 1;

                count++;
            }
            ht[i] = count;
        }

        KeyValuePair<long, int> max = new KeyValuePair<long, int>();
        foreach (var n in ht)
        {
            if (n.Value > max.Value)
                max = n;
        }
        Console.WriteLine(max.Key);
    }
}

had a runtime of just over one second. I knew that speed of execution wasn't a prime goal in Powershell. It's an adminstration language and for those tasks the ratio of PS code to cmdlets is likely very different from what I do here.

Still, I don't know what exactly causes the slowdown.

Suspecting the hash table, I replaced it for caching with an array. This resulted in an execution time around 200 ms in C# and about 32 minutes in Powershell. Code was as follows:

$r = ,0*1000000

for($i = 1; $i -lt 1000000; $i++) {
    $n = 0
    $a = $i
    while ($a -gt 1) {
        if ($r[$a]) {
            $n += $r[$a]
            break
        }
        if ($a % 2 -eq 0) {
            $a /= 2
        } else {
            $a = $a * 3 + 1
        }
        $n++
    }
    if ($i -lt 1000000) {
        $r[$i] = $n
    }
}

$max = 0
for($i=1; $i -lt 1000000; $i++) {
    if ($r[$i] > $r[$max]) {
        $max = $i
    }
}
$max

and

using System;

class Problem14
{
    public static void Main()
    {
        var cache = new int[1000000];

        for (int i = 1; i < 1000000; i++)
        {
            int count = 0;
            long j = i;

            while (j > 1)
            {
                if (j < 1000000 && cache[j] != 0)
                {
                    count += cache[j];
                    break;
                }
                if (j % 2 == 0)
                    j /= 2;
                else
                    j = 3 * j + 1;

                count++;
            }
            cache[i] = count;
        }

        var max = 0;
        for (int i = 1; i < cache.Length; i++)
        {
            if (cache[i] > cache[max])
                max = i;
        }

        Console.WriteLine(max);
    }
}

A completely cacheless variant was around 1.2 seconds in C#. Didn't try yet in Powershell.

Any ideas?

+2  A: 

First and foremost, PowerShell is an interpreted language (not jitted, nor compiled). That's always going to hurt. ;-)

Secondly, there are some language constructs that you may use that can avoid multiple interpreted steps. For example, instead of using a for(;;) statement, use a range:

0..1000000 | foreach-object { foo $_

Might help a bit.

Most importantly, avoid the excessive use of break and continue keywords in loops - invert your logic if possible to avoid this. Internally, these keywords signal using .NET exceptions, and are therefore expensive operations.

Hope this helps your understanding somewhat.

EDIT: these keywords use .NET exceptions for signalling

From System.Management.Automation.FlowControlNode.Execute(...):

switch (this._tokenId)
    {
        case TokenId.ExitToken:
        {
            int exitCode = this.GetExitCode(result);
            throw new ExitException(base.NodeToken, exitCode);
        }
        case TokenId.ReturnToken:
            throw new ReturnException(base.NodeToken, result);

        case TokenId.BreakToken:
            label = this.GetLabel(result, context);
            throw new BreakException(base.NodeToken, label);

        case TokenId.ContinueToken:
            label = this.GetLabel(result, context);
            throw new ContinueException(base.NodeToken, label);

-Oisin

x0n
At least in my benchmarks the pipeline solution using ranges always was a little slower than an iterative one. A for loop doesn't involve creating a range and passing objects through the pipeline, for example :-)
Joey
Yea, the Pipeline will be slower. Anything that loops will be slower in PowerShell than C#.
JasonMArcher
Oh, and may I ask for the source of the fact that break and continue are exceptions internally? Considering the other fairly normal structured programming keywords and how they seem to work this sounds a little awkward to me to solve that with exceptions.
Joey
Edited original post to show code (from Reflector). Oh, and sorry about the pipeline/range hint. I got that backwards. You're right, the pipeline tends to be slower [in most cases].
x0n