views:

142

answers:

7

I need to resample big sets of data (few hundred spectra, each containing few thousand points) using simple linear interpolation.

I have created interpolation method in C# but it seems to be really slow for huge datasets.

How can I improve the performance of this code?

public static List<double> interpolate(IList<double> xItems, IList<double> yItems, IList<double> breaks)
        {
                double[] interpolated = new double[breaks.Count];
                int id = 1;
                int x = 0;
                while(breaks[x] < xItems[0])
                {
                    interpolated[x] = yItems[0];
                    x++;
                }

                double p, w;
                // left border case - uphold the value
                for (int i = x; i < breaks.Count; i++) 
                {
                    while (breaks[i] > xItems[id])
                    {
                        id++;
                        if (id > xItems.Count - 1)
                        {
                            id = xItems.Count - 1;
                            break;
                        }
                    }

                    System.Diagnostics.Debug.WriteLine(string.Format("i: {0}, id {1}", i, id));

                    if (id <= xItems.Count - 1)
                    {
                        if (id == xItems.Count - 1 && breaks[i] > xItems[id])
                        {

                            interpolated[i] = yItems[yItems.Count - 1];
                        }
                        else
                        {
                            w = xItems[id] - xItems[id - 1];
                            p = (breaks[i] - xItems[id - 1]) / w;
                            interpolated[i] = yItems[id - 1] + p * (yItems[id] - yItems[id - 1]);
                        }
                    }
                    else // right border case - uphold the value
                    {
                        interpolated[i] = yItems[yItems.Count - 1];
                    }

                }
                return interpolated.ToList();

        }

Edit
Thanks, guys, for all your responses. What I wanted to achieve, when I wrote this questions, were some general ideas where I could find some areas to improve the performance. I haven't expected any ready solutions, only some ideas. And you gave me what I wanted, thanks!

Before writing this question I thought about rewriting this code in C++ but after reading comments to Will's asnwer it seems that the gain can be less than I expected.

Also, the code is so simple, that there are no mighty code-tricks to use here. Thanks to Petar for his attempt to optimize the code

It seems that all reduces the problem to finding good profiler and checking every line and soubroutine and trying to optimize that.

Thank you again for all responses and taking your part in this discussion!

A: 

This is the kind of problem where you need to move over to native code.

Will
Native code? What do you mean?
Gacek
he means to do it in unmanaged code
Sam Holder
For such pure arithmetics CLR is no slower than native code - it will be compiled into very similar machine code in both cases. Managed can even be faster due to faster memory allocation.SIMD may give a performance boost, but it requires special care, you won't get good SIMD utilization in generic C code.
ima
Thanks for the explanation, I will try to check it :)
Gacek
@Will: do you realy think that operations on doubles and arrays are significantly faster in C (or any unmanaged language of your choice)? I wouldn't recommend switching to unmanaged code without profiling the existing code first.
Igor Korkhov
The accesses would not be bounds checked, for example..
Will
...and there are a lot of virtual method calls, though I doubt that you would see more than a couple of ms for 10'000 iterations. Then again the OP asked how to improve performance and unmanaged code is definitely a possibility. Just watch out for those dreaded unmanaged-managed transitions or they'll eat up all the performance gained by the beautiful C code :-P
SealedSun
On bounds checking: even if it were noticeable, it's not related to managed or native code. You can have bounds checking in C++ and you can avoid it in C# with unsafe block. (Besides, I believe CLR can optimize away excessive bounds checking when arrays are accessed sequentially)
ima
+1  A: 

I'd say profile the code and see where it spends its time, then you have somewhere to focus on.

ANTS is popular, but Equatec is free I think.

Sam Holder
A: 
System.Diagnostics.Debug.WriteLine(string.Format("i: {0}, id {1}", i, id));

I hope it's release build without DEBUG defined?

Otherwise, it might depend on what exactly are those IList parameters. May be useful to store Count value instead of accessing property every time.

ima
Yes, as I've commented somewhere... ;)
Gacek
+2  A: 
public static List<double> Interpolate(IList<double> xItems, IList<double> yItems, IList<double> breaks)
{
    var a = xItems.ToArray();
    var b = yItems.ToArray();

    var aLimit = a.Length - 1;
    var bLimit = b.Length - 1;

    var interpolated = new double[breaks.Count];

    var total = 0;
    var initialValue = a[0];
    while (breaks[total] < initialValue)
    {
        total++;
    }
    Array.Copy(b, 0, interpolated, 0, total);

    int id = 1;
    for (int i = total; i < breaks.Count; i++)
    {
        var breakValue = breaks[i];

        while (breakValue > a[id])
        {
            id++;
            if (id > aLimit)
            {
                id = aLimit;
                break;
            }
        }

        double value = b[bLimit];

        if (id <= aLimit)
        {
            var currentValue = a[id];
            var previousValue = a[id - 1];
            if (id != aLimit || breakValue <= currentValue)
            {
                var w = currentValue - previousValue;
                var p = (breakValue - previousValue) / w;
                value = b[id - 1] + p * (b[id] - b[id - 1]);
            }
        }

        interpolated[i] = value;
    }
    return interpolated.ToList();
}

I've cached some (const) values and used Array.Copy, but I think these are micro optimization that are already made by the compiler in Release mode. However You can try this version and see if it will beat the original version of the code.

Petar Petrov
There is almost no difference, but thank you anyway, your code is a little bit cleaner :)
Gacek
+2  A: 

Instead of

interpolated.ToList()

which copies the whole array, you compute the interpolated values directly in the final list (or return that array instead). Especially if the array/List is big enough to qualify for the large object heap.

Unlike the ordinary heap, the LOH is not compacted by the GC, which means that short lived large objects are far more harmful than small ones.

Then again: 7000 doubles are approx. 56'000 bytes which is below the large object threshold of 85'000 bytes (1).

SealedSun
+1  A: 

Looks to me you've created an O(n^2) algorithm. You are searching for the interval, that's O(n), then probably apply it n times. You'll get a quick and cheap speed-up by taking advantage of the fact that the items are already ordered in the list. Use BinarySearch(), that's O(log(n)).

If still necessary, you should be able to do something speedier with the outer loop, what ever interval you found previously should make it easier to find the next one. But that code isn't in your snippet.

Hans Passant
+1  A: 

Hi,

few suggestions,

  • as others suggested, use profiler to understand better where time is used.

  • the loop

while (breaks[x] < xItems[0])

could cause exception if x grows bigger than number of items in "breaks" list. You should use something like

while (x < breaks.Count && breaks[x] < xItems[0])

But you might not need that loop at all. Why treat the first item as special case, just start with id=0 and handle the first point in for(i) loop. I understand that id might start from 0 in this case, and [id-1] would be negative index, but see if you can do something there.

  • If you want to optimize for speed then you sacrifice memory size, and vice versa. You cannot usually have both, except if you make really clever algorithm. In this case, it would mean to calculate as much as you can outside loops, store those values in variables (extra memory) and use them later. For example, instead of always saying:

id = xItems.Count - 1;

You could say:

int lastXItemsIndex = xItems.Count-1;
...
id = lastXItemsIndex;

This is the same suggestion as Petar Petrov did with aLimit, bLimit....

  • next point, your loop (or the one Petar Petrov suggested):
while (breaks[i] > xItems[id])
{
  id++;
  if (id > xItems.Count - 1)
  {
    id = xItems.Count - 1;
    break;
  }
}

could probably be reduced to:

double currentBreak = breaks[i];

while (id <= lastXIndex && currentBreak > xItems[id]) id++;

  • and the last point I would add is to check if there is some property in your samples that is special for your problem. For example if xItems represent time, and you are sampling in regular intervals, then

w = xItems[id] - xItems[id - 1];

is constant, and you do not have to calculate it every time in the loop.

This is probably not often the case, but maybe your problem has some other property which you could use to improve performance.

Another idea is this: maybe you do not need double precision, "float" is probably faster because it is smaller.

Good luck

Thanks Zarko. One thing: `w` is not always constant. It is distance between two subsequent samples, which is not always the same. That's the point why I am calculating it each time. And changing to floats could be good idea, I will try it
Gacek