views:

806

answers:

15

Hi, I'd like to optimize this piece of code :

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
        for (int x = 0; x < Width; x++)
        {
            for (int y = 0; y < Height; y++)
            {
                Byte  pixelValue = image.GetPixel(x, y).B;
                this.sumOfPixelValues[x, y] += pixelValue;
                this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
            }
        }
}

This is to be used for image processing, and we're currently running this for about 200 images. We've optimized the GetPixel value to use unsafe code, and we're not using image.Width, or image.Height, as those properties were adding to our runtime costs.

However, we're still stuck at a low speed. The problem is that our images are 640x480, so the middle of the loop is being called about 640x480x200 times. I'd like to ask if there's a way to speed it up somehow, or convince me that it's fast enough as it is. Perhaps a way is through some fast Matrix Addition, or is Matrix Addition inherently an n^2 operation with no way to speed it up?

Perhaps doing array accesses via unsafe code would speed it up, but I'm not sure how to go about doing it, and whether it would be worth the time. Probably not. Thanks.

EDIT : Thank you for all your answers.

This is the GetPixel method we're using:

 public Color GetPixel(int x, int y)
    {
        int offsetFromOrigin = (y * this.stride) + (x * 3);
        unsafe
        {
            return Color.FromArgb(this.imagePtr[offsetFromOrigin + 2], this.imagePtr[offsetFromOrigin + 1], this.imagePtr[offsetFromOrigin]);
        }
    }
A: 

Matrix addition is of course an n^2 operation but you can speed it up by using unsafe code or at least using jagged arrays instead of multidimensional.

Henrik
No, matrix addition is linear to the number of elements, so O(n) or O(n*m) for a 2d (n x m) matrix.
kigurai
if you have two matrices mxn then m *n additions are needed. If n = m then n^2 , right?
dfa
I was referring to n x n matrices. I thought this was obvious. Maybe it wasn't.
Henrik
@Henrik, @dfa: The time complexity is O(N), linear. It depends on the number of elements, not the number of rows. There is nothing in the problem formulation that says that the matrix is square.
kigurai
Yes, I should have written m*n instead of n^2. I took the n^2 from the OP. I'm sure nobody here thinks matrix addition is O(N^2), where N is the number of elements.
Henrik
Yes, I mean n^2 if m = n, obviously.
Jean Azzopardi
+2  A: 

I recommend that you profile this code and find out what's taking the most time.

You may find that it's the subscripting operation, in which case you might want to change your data structures from:

long sumOfPixelValues[n,m];
long sumOfPixelValuesSquared[n,m];

to

struct Sums
{
    long sumOfPixelValues;
    long sumOfPixelValuesSquared;
}

Sums sums[n,m];

This would depend on what you find once you profile the code.

John Saunders
+1  A: 

Where are images stored? If each is on disk, then a bit of your processing time issue may be in fetching them from the disk. You might examine this to see if it is an issue, and if so, then rewrite to pre-fetch the image data so that the array procesing code does not have to wait for the data...

If the overall application logic will allow it (Is each matrix addition independant, or dependant on output of a previous matrix addition?) If they are independant, I'd examine executing them all on separate threads, or in parallel..

Charles Bretana
Actually, getting the images from disk proved to be pretty fast.
Jean Azzopardi
+1  A: 

The only possible way I can think of to speed it up would be to try do some of the additions in parallel, which with your size might be beneficial over the threading overhead.

Yuriy Faktorovich
+2  A: 

Code profiling is the best place to start.

Matrix addition is a highly parallel operation and can be speed up by parallelizing the operation w/ multiple threads.

I would recommend using Intels IPP library that contains threaded highly optimized API for this sort of operation. Perhaps surprisingly it's only about $100 - but would add significant complexity to your project.

If you don't want to trouble yourself with mixed language programming and IPP, you could try out centerspace's C# math libraries. The NMath API contains easy to used, forward scaling, matrix operations.

Paul

Paul
Thanks, we actually got around to optimizing this by profiling the code with EQATEC profiler, so it's very good advice indeed. +1
Jean Azzopardi
Great! I've had good luck w/ Ants in the past.
Paul
+17  A: 

Despite using unsafe code, GetPixel may well be the bottleneck here. Have you looked at ways of getting all the pixels in the image in one call rather than once per pixel? For instance, Bitmap.LockBits may be your friend...

On my netbook, a very simply loop iterating 640 * 480 * 200 times only take about 100 milliseconds - so if you're finding it's all going slowly, you should take another look at the bit inside the loop.

Another optimisation you might want to look at: avoid multi-dimensional arrays. They're significantly slower than single-dimensional arrays.

In particular, you can have a single-dimensional array of size Width * Height and just keep an index:

int index = 0;
for (int x = 0; x < Width; x++)
{
    for (int y = 0; y < Height; y++)
    {
        Byte pixelValue = image.GetPixel(x, y).B;
        this.sumOfPixelValues[index] += pixelValue;
        this.sumOfPixelValuesSquared[index] += pixelValue * pixelValue;
        index++;
    }
}

Using the same simple test harness, adding a write to a 2-D rectangular array took the total time of looping over 200 * 640 * 480 up to around 850ms; using a 1-D rectangular array took it back down to around 340ms - so it's somewhat significant, and currently you've got two of those per loop iteration.

Jon Skeet
That rings a bell... I'm sure I have had to avoid GetPixel before.
GraemeF
GetPixel is quite an issue. It does checking to make sure that the pixels are in bounds etc. Get the image information and do the value parsing on your own. This can make the process [for an image] go from ~4min to less than 10seconds. There is something really funny going on with getpixel.
monksy
+1, especially for mentioning GetPixel. I've just run into the same issue myself.
MikeJ-UK
I would also add that you could put this code in an unsafe block to avoid bounds checking on the arrays. This is a significate overhead when using arrays in c#.
Paul
@Paul: Unsafe code still performs array bounds checking.
Jon Skeet
Thank you, Mr. Skeet, by using Lockbits, we achieved a very good speed up indeed.Your high SO rep is well deserved.
Jean Azzopardi
@Jon Skeet : Are you sure that "Unsafe code still performs array bounds checking" It was my understanding that this was definitely not the case!
ChloeRadshaw
@Jon Skeet, Chloe : Bounds checking is performed if you index the array, not if you sweep through it using a pinning pointer. But bounds checking is also optimized away in safe code if the for loop is written to test against the Length member of the array being indexed.
Ben Voigt
It's not the change from 2-D to 1-D that makes the difference, it's the way you move through the array. Your 1-D code definitely moves through sequentially. You've effectively interleaved the data from row-major to column-major to make that happen (formerly index was y*Width + x, now it is x*Height + y). Improving locality of reference is a real win on CPU cache usage, but re-interleaving the result arrays is the wrong way to do it, you're still accessing the row-major source data badly. Interleave source and destination the same and reverse the loop nesting.
Ben Voigt
Note that I'm not saying that 2-D arrays are as fast as 1-D arrays, just that the speedup you encountered has another cause. Reversing the loop nesting AND conversion to a 1-D array may well be faster yet.
Ben Voigt
@Ben: I see what you're saying, but actually all of the speed-up from 850 to 340 was from going from 2D to 1D - it was already sweeping through appropriately. Look at the original code again - it was incrementing y in the inner loop, and accessing `array[x, y]` - it was already `x * Height + y`, basically. Changing it to be `array[y, x]` does indeed make it much slower though.
Jon Skeet
I agree with Ben, this is only faster because you're sequentially accessing the sumOf... arrays, and in the process, transposing them - (0,0) maps to (0,0); (0,1)->(1,0); (0,2)->(2,0) since incrementing through a 1D array is like doing: for y { for x { [x,y] } }. Swap the X and Y loops to untranspose them (and get another speed up).
Skizz
@Skizz: I tested this before posting the reply. If you change the loop order, it slows it down. The code in the OP's question is already accessing [0, 0], [0, 1], [0, 2] etc, which is the *fast* way.
Jon Skeet
(Note that I'm talking about the *array* addressing rather than pixel access. Reversing both the array sense *and* the iteration order may well help, but not just one of them.)
Jon Skeet
Thanks for correcting me on storage in 2-D arrays. Now I've got to fix my blog post to make source and destination use the same interleaving.
Ben Voigt
Yikes, I didn't realize I'd sparked a blog post, and some other questions too :D
Jean Azzopardi
I had my maths hat on there ([x,y] and [x,y+1] are one row apart) not my programming one (where offset = (a * b_size + b) * c_size + c ...). Doh!
Skizz
A: 

matrix's addition complexity is O(n^2), in number of additions.

However, since there are no intermediate results, you can parallelize the additions using threads:

  1. it easy to proof that the resulting algorithm will be lock-free
  2. you can tune the optimal number of threads to use
dfa
I always thought matrix addition was O(N). Adding 6x3 matrices takes twice as long as adding 3x3 matrices. The problem of adding matrices can be reduced to the problem of adding arrays which can be solved in linear time.
Martinho Fernandes
if you have two matrices mxn then m *n additions are needed. If n = m then n^2 , right?
dfa
@dfa: What's a matrix with double the size of an MxN matrix? A (2*M)xN matrix or a (2*M)x(2*N) matrix?
Martinho Fernandes
a (2*m)x(2xn) matrix, when n=m you need (2*n)^2 additions
dfa
"a (2*m)x(2xn) matrix" That's where our opinions differ. To me a matrix twice as large as a mxn matrix is a (2*m)xn matrix. A (2*m)x(2*n) matrix is quadruple the size of an mxn matrix. So, by our reasoning you would say adding 3-dimensional matrices is runs in cubic time?
Martinho Fernandes
of course.. :)))
dfa
however math is not an opinion, you need an addition for each matrix element, no escape :)
dfa
@dfa: And if you double the number of elements, you ~double the running time. If you 3x the number of elements, you ~3x the running time. Linear. Is my reasoning wrong?
Martinho Fernandes
the number of additions is quadratic, if you are running timed tests your result is definitevely an approximation since big-O notation is asymptotic
dfa
I know about asymptotic complexity. It's not about the number of operations but how that number *grows* with the size of the input. We disagree on the size of the input. I won't say you're wrong, but I can't see why I should be wrong.
Martinho Fernandes
No. To add B to B you need to 2m*n additions. To add A to A you need to m*n additions. Double the size, double the additions. Linear.
Martinho Fernandes
**ouff** when m = n.. you get 2*n^2.. it is still quadratic. m * n is linear only if m = 1 or n = 1
dfa
I'll stick to the depends on your opinion of the size of the input.
Martinho Fernandes
I've used your formula: 2m*n
dfa
We're both right. Here: http://stackoverflow.com/questions/1870336/what-is-the-complexity-of-matrix-addition
Martinho Fernandes
It's linear in the width AND linear in the height so for a square matrix it would be quadratic in the side length (which is both width and height). However the matrix under discussion isn't square.
Ben Voigt
A: 

About the only way to effectively speed up your matrix multiplication is to use the right algorithm. There are more efficient ways to speed up matrix multiplication.Take a look at the Stressen and Coopersmith Winograd algorithms. It is also noted [with the previous replies] that you can parallize the code, which helps quite a bit.

monksy
Good luck with Strassen's algorithm. My educated guess is that the overhead in book-keeping it requires will cost more than any saving of flops on images as small as yours. If anyone cares to post hard data (and a C# implementation) to the contrary I'll eat my hat. Coopersmith-Winograd looks even less likely to be worth implementing.
High Performance Mark
agreed with @High-P
Yin Zhu
Yea.. I was just going to suggest that there is more than one way to do a matrix multiply.
monksy
A: 

I'm not sure if it's faster but you may write something like;

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
        Byte pixelValue;
        for (int x = 0; x < Width; x++)
        {
            for (int y = 0; y < Height; y++)
            {
                pixelValue = image.GetPixel(x, y).B;
                this.sumOfPixelValues[x, y] += pixelValue;
                this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
            }
        }
}
Silmaril
That won't be faster - just harder to read.
Jon Skeet
A: 

This is a classic case of micro-optimisation failing horribly. You're not going to get anything from looking at that loop. To get real speed benefits you need to start off by looking at the big picture:-

  • Can you asynchronously preload image[n+1] whilst processing image[n]?
  • Can you load just the B channel from the image? This will decrease memory bandwidth?
  • Can you load the B value and update the sumOfPixelValues(Squared) arrays directly, i.e. read the file and update instead of read file, store, read, update? Again, this decreases memory bandwidth.
  • Can you use one dimensional arrays instead of two dimensional? Maybe create your own array class that works either way.
  • Perhaps you could look into using Mono and the SIMD extensions?
  • Can you process the image in chunks and assign them to idle CPUs in a multi-cpu environment?

EDIT:

Try having specialised image accessors so you're not wasting memory bandwidth:

public Color GetBPixel (int x, int y)
{
    int offsetFromOrigin = (y * this.stride) + (x * 3);
    unsafe
    {
        return this.imagePtr [offsetFromOrigin + 1];
    }
}

or, better still:

public Color GetBPixel (int offset)
{
    unsafe
    {
        return this.imagePtr [offset + 1];
    }
}

and use the above in a loop like:

for (int start_offset = 0, y = 0 ; y < Height ; start_offset += stride, ++y)
{
   for (int x = 0, offset = start_offset ; x < Width ; offset += 3, ++x)
   {
      pixel = GetBPixel (offset);
      // do stuff
   }
}
Skizz
I disagree, GetPixel() is notorious for being slow and therefore it makes sense to avoid it's use if possible.
Ian
@Ian: The code given has 'GenericImage image' as the parameter which AFAIK is not a standard .net framework class, so it's possibly a more efficient image processing class, perhaps a third party library, with a decent implementation of GetPixel.
Skizz
Yes, we did our own Image library (basically, adding upon what was already there.)
Jean Azzopardi
considering CPU cache effects may be a micro-optimization, but it's a very effective one. All three arrays (image data and both outputs) are about 1 megabyte in size, by accessing them randomly you're going to be hitting either L3 cache or main memory continuously. By iterating in the correct order, you'll use L1 cache which is fully an order of magnitude faster.
Ben Voigt
In addition to Ben's comment, the later Pentiums can detect sequentially accessed memory and prefetch the data, reducing the chance of cache misses. IIRC it can detect up to four streams of data.
Skizz
+1  A: 

If you only do matrix addition, you'd like to consider using multiple threads to speed up by taking advantage of multi-core processors. Also use one dimensional index instead of two.

If you want to do more complicated operations, you need to use a highly optimized math library, like NMath.Net, which uses native code rather than .net.

Yin Zhu
+5  A: 

Read this article which also has some code and mentions about the slowness of GetPixel.

link text

From the article this is code to simply invert bits. This shows you the usage of LockBits as well.

It is important to note that unsafe code will not allow you to run your code remotely.

public static bool Invert(Bitmap b)
{

BitmapData bmData = b.LockBits(new Rectangle(0, 0, b.Width, b.Height), 
                               ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb); 

int stride = bmData.Stride; 
System.IntPtr Scan0 = bmData.Scan0; 
unsafe 
{ 
    byte * p = (byte *)(void *)Scan0;
    int nOffset = stride - b.Width*3; 
    int nWidth = b.Width * 3;
    for(int y=0;y < b.Height;++y)
    {
        for(int x=0; x < nWidth; ++x )
        {
            p[0] = (byte)(255-p[0]);
            ++p;
        }
        p += nOffset;
    }
}

b.UnlockBits(bmData);

return true;

}

anirudhgarg
Good find. I'm doing similar tricks in java - massive speedup!
Otto Allmendinger
Thanks, this came in useful
Jean Azzopardi
A: 

Sometimes doing things in native C#, even unsafe calls, is just slower than using methods that have already been optimized.

No results guaranteed, but you may want to investigate the System.Windows.Media.Imaging name space and look at your whole problem in a different way.

Bytemaster
A: 

Although it's a micro-optimization and thus may not add much you might want to study what the likelihood is of getting a zero when you do

Byte  pixelValue = image.GetPixel(x, y).B;

Clearly, if pixelValue = 0 then there's no reason to do the summations so your routine might become

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
  {
  for (int x = 0; x < Width; x++)
    {
    for (int y = 0; y < Height; y++)
      {
       Byte  pixelValue = image.GetPixel(x, y).B;

       if(pixelValue != 0)
         {
         this.sumOfPixelValues[x, y] += pixelValue;
         this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
         }}}}

However, the question is how often you're going to see pixelValue=0, and whether the saving on the compute-and-store will offset the cost of the test.

Bob Jarvis
+2  A: 

System.Drawing.Color is a structure, which on current versions of .NET kills most optimizations. Since you're only interested in the blue component anyway, use a method that only gets the data you need.

public byte GetPixelBlue(int x, int y)
{
    int offsetFromOrigin = (y * this.stride) + (x * 3);
    unsafe
    {
        return this.imagePtr[offsetFromOrigin];
    }
}

Now, exchange the order of iteration of x and y:

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
    for (int y = 0; y < Height; y++)
    {
        for (int x = 0; x < Width; x++)
        {
            Byte  pixelValue = image.GetPixelBlue(x, y);
            this.sumOfPixelValues[y, x] += pixelValue;
            this.sumOfPixelValuesSquared[y, x] += pixelValue * pixelValue;
        }
    }
}

Now you're accessing all values within a scan line sequentially, which will make much better use of CPU cache for all three matrices involved (image.imagePtr, sumOfPixelValues, and sumOfPixelValuesSquared. [Thanks to Jon for noticing that when I fixed access to image.imagePtr, I broke the other two. Now the output array indexing is swapped to keep it optimal.]

Next, get rid of the member references. Another thread could theoretically be setting sumOfPixelValues to another array midway through, which does horrible horrible things to optimizations.

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{          
    uint [,] sums = this.sumOfPixelValues;
    ulong [,] squares = this.sumOfPixelValuesSquared;
    for (int y = 0; y < Height; y++)
    {
        for (int x = 0; x < Width; x++)
        {
            Byte  pixelValue = image.GetPixelBlue(x, y);
            sums[y, x] += pixelValue;
            squares[y, x] += pixelValue * pixelValue;
        }
    }
}

Now the compiler can generate optimal code for moving through the two output arrays, and after inlining and optimization, the inner loop can step through the image.imagePtr array with a stride of 3 instead of recalculating the offset all the time. Now an unsafe version for good measure, doing the optimizations that I think .NET ought to be smart enough to do but probably isn't:

unsafe public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{          
    byte* scanline = image.imagePtr;
    fixed (uint* sums = &this.sumOfPixelValues[0,0])
    fixed (uint* squared = &this.sumOfPixelValuesSquared[0,0])
    for (int y = 0; y < Height; y++)
    {
        byte* blue = scanline;
        for (int x = 0; x < Width; x++)
        {
            byte pixelValue = *blue;
            *sums += pixelValue;
            *squares += pixelValue * pixelValue;
            blue += 3;
            sums++;
            squares++;
        }
        scanline += image.stride;
    }
}
Ben Voigt
Thanks for your answer. In the end, we ended up using LockBits and accessed the raw structure of the image directly instead of using methods, such as GetPixel or GetPixelBlue.
Jean Azzopardi
ack! squares had better be a UInt64 array since the maximum value is 255*255 * 640*480 > 2^32. That makes it over 2 megabytes in size which pretty well assures random access means hits on main memory... exchanging the loops is absolutely critical to getting good speed.
Ben Voigt
@Jean: from your version of GetPixel, which used imagePtr, I thought you were already using LockBits... where else did you get imagePtr?
Ben Voigt
oops, no reason for squares to be UInt64, it's not 640*480 pixels being put in each but the same pixel in 200 images, so UInt32 is fine. Still pretty big compared to cache though.
Ben Voigt
@Ben Voigt, I mean, I replaced the GetPixel() method and just used Lockbits on the image to get the pixels directly, i.e. without using methods.
Jean Azzopardi