ansaurus

Question

Speed up math code in C# by writing a C dll?

Answer 1

+1 A:

Did you try parallel programming?

http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.aspx

Raj Kaimal 2010-05-25 01:51:12

I can't use parallel for loops unfortuneately because GetS and GetC depend on the generated values of omega.

Projectile Fish 2010-05-25 01:53:56

Without knowing what GetS() and GetC() actually do, I'd be surprised if you could not partition the array processing to allow parallel computation. Each inner loop iteration only deals with the single values of i and j. You said that GetS() and GetC() have very low CPU usage, so I doubt they operate over the whole of the omega array. PP is where I'd put my effort.

Simon Chadwick 2010-05-25 02:17:10

I just edited the code snippet to show how s and c are generated. Maybe it could be parallelized since its actually code based on a systolic array implementation, but I wouldn't know how to multithread that on a normal multicore CPU.

Projectile Fish 2010-05-25 02:36:09

Answer 2

+3 A:

It's highly unlikely that running this in native C/C++ would "automatically" speed things up. If you're good with SIMD (and length1 and length2 are large enough that the P/Invoke call is not significant) then maybe you could do something.

But the only way to know for sure would be to try it and profile.

Dean Harding 2010-05-25 01:53:32

Answer 3

A:

Highly doubt it. Inner loop that processes primitive types and doesn't allocate memory will be very damn efficient in C#. The native bytecode will get generated once from IL, so there shouldn't be much managed overhead.

Considering it's a pretty small function, you can profile both and see if there is any difference.

Igor Zevaka 2010-05-25 01:53:36

Answer 4

+7 A:

Use an unsafe block and pointers to index into your omega array. This will remove the overhead of range checking and may be a significant win if you do enough accesses. A lot of time may also be being spent in your GetS() and GetC() functions, which you didn't provide source for.

Donnie 2010-05-25 01:54:47

I profiled this code and GetS() and GetC() have very low CPU usage compared to the inner loop code. I'll try unsafe code, but with my experience using unsafe blocks tended to slow my code down. Maybe unsafe has an overhead I wasn't aware of which won't be so significant in a large loop.

Projectile Fish 2010-05-25 01:58:15

You might want to post some of the unsafe code you wrote. There are few reasons that it would be slower.

Rusty 2010-05-25 02:08:51

`fixed` statements tend to slow things down when put in a loop. The trick is to fix the array only once as outermost statement.

dtb 2010-05-25 02:11:39

You probably want to avoid entering and exiting an unsafe context frequently since it probably has code-access security implications like a stack walk to verify permissions.

Josh Einstein 2010-05-25 02:41:52

Would adding an unchecked{} block help?

Michael Stum 2010-05-27 01:06:25

Isn't unchecked{} only useful if you have "check for arithmetic overflow/underflow" ticked in the build options?

Projectile Fish 2010-05-27 03:45:15

Answer 5

+2 A:

Simply using C or C++ will not give you much of a speed increase, if any. You also have the overhead of calling into the C routine, not a huge impact, unless you are doing it many times in a loop.

Try some other things in C# first. If the variables are floats rather than doubles this slows down calculations. Also as Raj said using parallel programming will give you a big speed boost.

Bill W 2010-05-25 01:59:42

Answer 6

+1 A:

For plain 64-bit arithmetic in Java, I saw around 33% speedup (23 ns to 16 ns) when porting it to C and fiddling with optimization flags (-fprofile-generate, -fprofile-use). It might be worth it.

The other thing is that omega[i][j] makes it look like omega is an array of arrays — you may get better performance with a two dimensional array (I think the syntax is something like omega[i,j], but I forget how you allocate one).

tc. 2010-05-25 02:03:31

Answer 7

+8 A:

Updated:

What happens if you write inner loop to take account of locality of reference:

for (int i = 0; i < length1; i++) 
{ 
    s = GetS(i); 
    c = GetC(i); 
    double[] omegaTemp = omega[i]; 

    for(int j = 0; j < length2; j++) 
    { 
        double oldU = u[j]; 
        u[j] = c * oldU + s * omegaTemp[j]; 
        omegaTemp[j] = c * omegaTemp[j] - s * oldU; 
    } 
}

Mitch Wheat 2010-05-25 02:11:16

If I put omegaTemp = omega[i] in the outside loop, then call omegaTemp[j] from the inner loop I get a significant speedup.

Projectile Fish 2010-05-25 02:22:06

Thta's the locality of reference I mentioned in the comments...

Mitch Wheat 2010-05-25 02:24:15

Ah I see, thanks. Locality of reference works fairly well then. Why wouldn't the compiler see this automatically though?

Projectile Fish 2010-05-25 02:37:33

If I recall correctly, C# compiler can also optimize away bounds checking if your for loops are relatively simple. Mitch's suggestion may have also enabled this optimization. http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx

Josh Einstein 2010-05-25 02:44:38

That's an excellent article, Josh. Thx for posting.

Mitch Wheat 2010-05-25 02:48:00

I'm fiddling with unsafe code now for a couple of minutes and I can't seem to get a better result than this surprisingly simple solution.

dtb 2010-05-25 02:50:45

If the bounds check is indeed now being optimized away, I wouldn't expect much better perf with unsafe code.

Josh Einstein 2010-05-25 02:57:13

Bah! Too late to edit. Before Eric Lippert rips me a new one I should note that I misspoke above... it's not a C# optimization, it's a JIT optimization. So it is not specific to C#.

Josh Einstein 2010-05-25 03:02:44

@Josh: Good catch...and good to know. I think you have avoided raising the ire of the Lippert :)

Rusty 2010-05-25 04:43:53

Answer 8

+3 A:

You could try to use the Mono.Simd to utilize the CPU more optimimally.

http://tirania.org/blog/archive/2008/Nov-03.html

That being said, much can be gained in C# by manually extracting duplicate statements out of loops.

var outsideAddr0 = outsideGeneratedAddress[0];
var outsideAddr1 = outsideGeneratedAddress[1];
var outsideAddr2 = outsideGeneratedAddress[2];
var outsideAddr3 = outsideGeneratedAddress[3];
var outsideAddr4 = outsideGeneratedAddress[4];
var outsideAddr5 = outsideGeneratedAddress[5];
var outsideAddr6 = outsideGeneratedAddress[6];
var outsideAddr7 = outsideGeneratedAddress[7];
var outsideAddr8 = outsideGeneratedAddress[8];
var outsideAddr9 = outsideGeneratedAddress[9];
for (int i = 0; i < length1; i++)
{
  var omegaAtI = omega[i];
  double aa = 
   omegaAtI[outsideAddr0]
   + omegaAtI[outsideAddr1]
   + omegaAtI[outsideAddr2]
   + omegaAtI[outsideAddr3]
   + omegaAtI[outsideAddr4]
   + omegaAtI[outsideAddr5]
   + omegaAtI[outsideAddr6]
   + omegaAtI[outsideAddr7]
   + omegaAtI[outsideAddr8]
   + omegaAtI[outsideAddr9];

  double alphaOld = alpha;
  alpha = Math.Sqrt(alpha * alpha + aa * aa);

  var s = -aa / alpha;
  var c = alphaOld / alpha;

  for(int j = 0; j <= i; j++)
  {
    double oldU = u[j];
    var omegaAtIJ = omegaAtI[j];
    u[j] = c * oldU + s * omegaAtIJ;
    omegaAtI[j] = c * omegaAtIJ  - s * oldU;
  }
}

Cine 2010-05-25 03:00:09

Answer 9

A:

Also consider the cost of marshalling data between managed and native calls. C# has pretty fast execution speed. You can also NGEN the assembly to generate native images of the assembly for faster execution.

nitroxn 2010-05-25 04:30:34

Answer 10

+2 A:

.net interop with unmanaged code is very slow. You can use all benefits of unmanaged memory just by using system api to allocate unmanaged memory.

You can call VirtualAlloc to allocate memory pages and then call VirtualProtect to pin them directly to RAM without swaping.

This approach allows to perform calculations over large amount of data at least 3 times faster then you could do it in managed memory.

Andrew_B 2010-05-25 07:27:22

Answer 11

+1 A:

While most other answers tend to suggest that you look into C# solutions, most miss a point: C code for this method will be faster, provided that you use a good optimizing compiler (I'd suggest Intel, works great for this kind of code).
The compiler will also save a bit of work from the JIT and will yield a much better compiled output (even MSVC compiler can generate SSE2 instructions). Array bounds won't be checked by default, there will probably be some loop unrolling and - all in all - you're likely to see a significant performance boost.
As it has been properly pointed out, calling into native code may have a bit of overhead; this should, however, be insignificant compared to the speedup if length1 is big enough.
You may sure keep this code in C# but please remember that compared to several C compilers the CLR (like all other VMs I know) does little to optimize the generated code.

emaster70 2010-05-25 07:42:25

Answer 12

A:

I have no idea how practical this is, but did you think of trying to run this on a GPU? Perhaps using something like OpenCL or DirectCompute?

The dependencies and the square root might kill you, but GPUs have an order of magnitude more raw floating point performance than CPUs these days.

2010-05-27 04:10:36

Actually, I thought about this before and actually tried a quick experiment using Microsoft Accelerator, but I found it to be much slower. This was because I had to constantly transfer the u array from the GPU to the CPU and v.v as I needed it directly after the loop. And as I run this entire algorithm many times per second the cost of transferring was too much.

Projectile Fish 2010-05-27 04:39:59

ansaurus

tags:

views:

answers:

Speed up math code in C# by writing a C dll?

related questions