views:

596

answers:

12

I have a very large nested for loop in which some multiplications and additions are performed on floating point numbers.

for (int i = 0; i < length1; i++)
{
    double aa = 0;
    for(int h = 0; h < 10; h++)
    {
       aa += omega[i][outsideGeneratedAddress[h]];
    }

    double alphaOld = alpha;
    alpha = Math.Sqrt(alpha * alpha + aa * aa);

    s = -aa / alpha;
    c = alphaOld / alpha;

    for(int j = 0; j <= i; j++)
    {
        double oldU = u[j];
        u[j] = c * oldU + s * omega[i][j];
        omega[i][j] = c * omega[i][j] - s * oldU;
    }
}

This loop is taking up the majority of my processing time and is a bottleneck.

Would I be likely to see any speed improvements if I rewrite this loop in C and interface to it from C#?

EDIT: I updated the code to show how s and c are generated. Also the inner loop actually goes from 0 to i, though it probably doesn't make much difference to the question

EDIT2: I implemented the algorithm in VC++ and linked it with C# through a dll and saw a 28% speed boost over C# when all optimisations are enabled. The argument to enable SSE2 works particularly well. Compiling with MinGW and gcc4.4 only gave a 15% speed boost. Just tried the Intel compiler and saw a 49% speed boost for this code.

+1  A: 

Did you try parallel programming?

http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.aspx

Raj Kaimal
I can't use parallel for loops unfortuneately because GetS and GetC depend on the generated values of omega.
Projectile Fish
Without knowing what GetS() and GetC() actually do, I'd be surprised if you could not partition the array processing to allow parallel computation. Each inner loop iteration only deals with the single values of i and j. You said that GetS() and GetC() have very low CPU usage, so I doubt they operate over the whole of the omega array. PP is where I'd put my effort.
Simon Chadwick
I just edited the code snippet to show how s and c are generated. Maybe it could be parallelized since its actually code based on a systolic array implementation, but I wouldn't know how to multithread that on a normal multicore CPU.
Projectile Fish
+3  A: 

It's highly unlikely that running this in native C/C++ would "automatically" speed things up. If you're good with SIMD (and length1 and length2 are large enough that the P/Invoke call is not significant) then maybe you could do something.

But the only way to know for sure would be to try it and profile.

Dean Harding
A: 

Highly doubt it. Inner loop that processes primitive types and doesn't allocate memory will be very damn efficient in C#. The native bytecode will get generated once from IL, so there shouldn't be much managed overhead.

Considering it's a pretty small function, you can profile both and see if there is any difference.

Igor Zevaka
+7  A: 

Use an unsafe block and pointers to index into your omega array. This will remove the overhead of range checking and may be a significant win if you do enough accesses. A lot of time may also be being spent in your GetS() and GetC() functions, which you didn't provide source for.

Donnie
I profiled this code and GetS() and GetC() have very low CPU usage compared to the inner loop code. I'll try unsafe code, but with my experience using unsafe blocks tended to slow my code down. Maybe unsafe has an overhead I wasn't aware of which won't be so significant in a large loop.
Projectile Fish
You might want to post some of the unsafe code you wrote. There are few reasons that it would be slower.
Rusty
`fixed` statements tend to slow things down when put in a loop. The trick is to fix the array only once as outermost statement.
dtb
You probably want to avoid entering and exiting an unsafe context frequently since it probably has code-access security implications like a stack walk to verify permissions.
Josh Einstein
Would adding an unchecked{} block help?
Michael Stum
Isn't unchecked{} only useful if you have "check for arithmetic overflow/underflow" ticked in the build options?
Projectile Fish
+2  A: 

Simply using C or C++ will not give you much of a speed increase, if any. You also have the overhead of calling into the C routine, not a huge impact, unless you are doing it many times in a loop.

Try some other things in C# first. If the variables are floats rather than doubles this slows down calculations. Also as Raj said using parallel programming will give you a big speed boost.

Bill W
+1  A: 

For plain 64-bit arithmetic in Java, I saw around 33% speedup (23 ns to 16 ns) when porting it to C and fiddling with optimization flags (-fprofile-generate, -fprofile-use). It might be worth it.

The other thing is that omega[i][j] makes it look like omega is an array of arrays — you may get better performance with a two dimensional array (I think the syntax is something like omega[i,j], but I forget how you allocate one).

tc.
+8  A: 

Updated:

What happens if you write inner loop to take account of locality of reference:

for (int i = 0; i < length1; i++) 
{ 
    s = GetS(i); 
    c = GetC(i); 
    double[] omegaTemp = omega[i]; 

    for(int j = 0; j < length2; j++) 
    { 
        double oldU = u[j]; 
        u[j] = c * oldU + s * omegaTemp[j]; 
        omegaTemp[j] = c * omegaTemp[j] - s * oldU; 
    } 
} 
Mitch Wheat
If I put omegaTemp = omega[i] in the outside loop, then call omegaTemp[j] from the inner loop I get a significant speedup.
Projectile Fish
Thta's the locality of reference I mentioned in the comments...
Mitch Wheat
Ah I see, thanks. Locality of reference works fairly well then. Why wouldn't the compiler see this automatically though?
Projectile Fish
If I recall correctly, C# compiler can also optimize away bounds checking if your for loops are relatively simple. Mitch's suggestion may have also enabled this optimization. http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx
Josh Einstein
That's an excellent article, Josh. Thx for posting.
Mitch Wheat
I'm fiddling with unsafe code now for a couple of minutes and I can't seem to get a better result than this surprisingly simple solution.
dtb
If the bounds check is indeed now being optimized away, I wouldn't expect much better perf with unsafe code.
Josh Einstein
Bah! Too late to edit. Before Eric Lippert rips me a new one I should note that I misspoke above... it's not a C# optimization, it's a JIT optimization. So it is not specific to C#.
Josh Einstein
@Josh: Good catch...and good to know. I think you have avoided raising the ire of the Lippert :)
Rusty
+3  A: 

You could try to use the Mono.Simd to utilize the CPU more optimimally.

http://tirania.org/blog/archive/2008/Nov-03.html

That being said, much can be gained in C# by manually extracting duplicate statements out of loops.

var outsideAddr0 = outsideGeneratedAddress[0];
var outsideAddr1 = outsideGeneratedAddress[1];
var outsideAddr2 = outsideGeneratedAddress[2];
var outsideAddr3 = outsideGeneratedAddress[3];
var outsideAddr4 = outsideGeneratedAddress[4];
var outsideAddr5 = outsideGeneratedAddress[5];
var outsideAddr6 = outsideGeneratedAddress[6];
var outsideAddr7 = outsideGeneratedAddress[7];
var outsideAddr8 = outsideGeneratedAddress[8];
var outsideAddr9 = outsideGeneratedAddress[9];
for (int i = 0; i < length1; i++)
{
  var omegaAtI = omega[i];
  double aa = 
   omegaAtI[outsideAddr0]
   + omegaAtI[outsideAddr1]
   + omegaAtI[outsideAddr2]
   + omegaAtI[outsideAddr3]
   + omegaAtI[outsideAddr4]
   + omegaAtI[outsideAddr5]
   + omegaAtI[outsideAddr6]
   + omegaAtI[outsideAddr7]
   + omegaAtI[outsideAddr8]
   + omegaAtI[outsideAddr9];

  double alphaOld = alpha;
  alpha = Math.Sqrt(alpha * alpha + aa * aa);

  var s = -aa / alpha;
  var c = alphaOld / alpha;

  for(int j = 0; j <= i; j++)
  {
    double oldU = u[j];
    var omegaAtIJ = omegaAtI[j];
    u[j] = c * oldU + s * omegaAtIJ;
    omegaAtI[j] = c * omegaAtIJ  - s * oldU;
  }
}
Cine
A: 

Also consider the cost of marshalling data between managed and native calls. C# has pretty fast execution speed. You can also NGEN the assembly to generate native images of the assembly for faster execution.

nitroxn
+2  A: 

.net interop with unmanaged code is very slow. You can use all benefits of unmanaged memory just by using system api to allocate unmanaged memory.

You can call VirtualAlloc to allocate memory pages and then call VirtualProtect to pin them directly to RAM without swaping.

This approach allows to perform calculations over large amount of data at least 3 times faster then you could do it in managed memory.

Andrew_B
+1  A: 

While most other answers tend to suggest that you look into C# solutions, most miss a point: C code for this method will be faster, provided that you use a good optimizing compiler (I'd suggest Intel, works great for this kind of code).
The compiler will also save a bit of work from the JIT and will yield a much better compiled output (even MSVC compiler can generate SSE2 instructions). Array bounds won't be checked by default, there will probably be some loop unrolling and - all in all - you're likely to see a significant performance boost.
As it has been properly pointed out, calling into native code may have a bit of overhead; this should, however, be insignificant compared to the speedup if length1 is big enough.
You may sure keep this code in C# but please remember that compared to several C compilers the CLR (like all other VMs I know) does little to optimize the generated code.

emaster70
A: 

I have no idea how practical this is, but did you think of trying to run this on a GPU? Perhaps using something like OpenCL or DirectCompute?

The dependencies and the square root might kill you, but GPUs have an order of magnitude more raw floating point performance than CPUs these days.

Actually, I thought about this before and actually tried a quick experiment using Microsoft Accelerator, but I found it to be much slower. This was because I had to constantly transfer the u array from the GPU to the CPU and v.v as I needed it directly after the loop. And as I run this entire algorithm many times per second the cost of transferring was too much.
Projectile Fish