views:

744

answers:

9

I am trying to write a function to determine whether two equal-size bitmaps are identical or not. The function I have right now simply compares a pixel at a time in each bitmap, returning false at the first non-equal pixel.

While this works, and works well for small bitmaps, in production I'm going to be using this in a tight loop and on larger images, so I need a better way. Does anyone have any recommendations?

The language I'm using is C# by the way - and yes, I am already using the .LockBits method. =)

Edit: I've coded up implementations of some of the suggestions given, and here are the benchmarks. The setup: two identical (worst-case) bitmaps, 100x100 in size, with 10,000 iterations each. Here are the results:

CompareByInts (Marc Gravell) :   1107ms
CompareByMD5  (Skilldrick)   :   4222ms
CompareByMask (GrayWizardX)  :    949ms

In CompareByInts and CompareByMask I'm using pointers to access the memory directly; in the MD5 method I'm using Marshal.Copy to retrieve a byte array and pass that as an argument to MD5.ComputeHash. CompareByMask is only slightly faster, but given the context I think any improvement is useful.

Thanks everyone. =)

Edit 2: Forgot to turn optimizations on - doing that gives GrayWizardX's answer even more of a boost:

CompareByInts   (Marc Gravell) :    944ms
CompareByMD5    (Skilldrick)   :   4275ms
CompareByMask   (GrayWizardX)  :    630ms
CompareByMemCmp (Erik)         :    105ms

Interesting that the MD5 method didn't improve at all.

Edit 3: Posted my answer (MemCmp) which blew the other methods out of the water. o.O

+4  A: 

Well, you're using .LockBits, so presumably you're using unsafe code. Rather than treating each row origin (Scan0 + y * Stride) as a byte*, consider treating it as an int*; int arithmetic is pretty quick, and you only have to do 1/4 as much work. And for images in ARGB you might still be talking in pixels, making the math simple.

Marc Gravell
I will be using ARGB images so this sounds like it might be the winner. I'll give it a shot and do some profiling. Well, some more profiling.
Erik Forbes
+4  A: 

Could you take a hash of each and compare? It would be slightly probabilistic, but practically not.

Thanks to Ram, here's a sample implementation of this technique.

Skilldrick
It won't fail fast, but it does work better if you have to compare 1 image to multiple candidates...
EricLaw -MSFT-
You could do a hybrid, and test a random sample of say 2% of pixels to see if they're the same, and if so move onto hashing...
Skilldrick
+1 for mentioning hash. Here is a sample implementation http://www.codeproject.com/KB/GDI-plus/comparingimages.aspx
ram
@Ram - Thanks, perfect example!
Skilldrick
+4  A: 

If you are trying to determine if they are 100% equal, you can invert one and add it to the other if its zero they are identical. Extending this using unsafe code, take 64 bits at a time as a long and do the math that way, any differences can cause an immediate fail.

If the images are not 100% identical (comparing png to jpeg), or if you are not looking for a 100% match then you have some more work ahead of you.

Good luck.

GrayWizardx
I am looking for 100% identical, so this method could work. Thanks =)
Erik Forbes
Isn't adding one pixel to the inverse of another and seeing if the result is zero equivalent or slower to comparing the two pixels and seeing if they're the same? Or am I missing something?
Skilldrick
He is using pointer arithmetic (i.e. unsafe code) and so he can treat the pointers as an array of fixed size types (i.e. longs) and do pure hardware math.
GrayWizardx
I see, makes sense.
Skilldrick
Also testing for zero is performs better than testing for any other constant, from what I recall.
Erik Forbes
You can't invert one and subtract it from the other. That would be like a - (-b). If a==b, then this will equal a*2. I have no problems understanding what you mean though, but you should get the method correct.
Lasse V. Karlsen
@Lasse, yes! Thanks for pointing that out. I have corrected it.
GrayWizardx
I think inversion is still OK, provided its all done with 8 bits.
ram
32-bits works, and is fastest in this context - so I've marked it as the answer. I'll be posting benchmark results shortly.
Erik Forbes
By the way - the relevant comparison is: `if ((~cursor1[0] ` - invert one, AND it to the other, and check the result. Alternatively I could XOR them together without inverting, and compare that result to zero. Either way it seems like the advantage this has over comparing for equality is that comparisons with zero are faster than comparing with any other constant.
Erik Forbes
A: 

If you can implement something like Duff's Device in your language, that might give you a significant speed boost over a simple loop. Usually it's used for copying data, but there's no reason it can't be used for comparing data instead.

Or, for that matter, you may just want to use some equivalent to memcmp().

rmeador
You can do loop unrolling in virtually any language (where it applies). You may not get such pretty and compact syntax as Duff's device, but the performance will be similar.
Martinho Fernandes
A: 

You could try to add them to a database "blob" then use the database engine to compare their binaries. This would only give you a yes or no answer to whether the binary data is the same. It would be very easy to make 2 images that produce the same graphic but have different binary though.

You could also select a few random pixels and compare them, then if they are the same continue with more until you've checked all the pixels. This would only return a faster negative match though, it still would take as long to find 100% positive matches

Drew
+1  A: 

If the original problem is just to find the exact duplicates among two bitmaps, then just a bit level comparison will have to do. I don't know C# but in C I would use the following function:

int areEqual (long size, long *a, long *b)
{
    long start = size / 2;
    long i;
    for (i = start; i != size; i++) { if (a[i] != b[i]) return 0 }
    for (i = 0; i != start; i++) { if (a[i] != b[i]) return 0 }
    return 1;
}

I would start looking in the middle because I suspect there is a much better chance of finding unequal bits near the middle of the image than the beginning; of course, this would really depend on the images you are deduping, selecting a random place to start may be best.

If you are trying to find the exact duplicates among hundreds of images then comparing all pairs of them is unnecessary. First compute the MD5 hash of each image and place it in a list of pairs (md5Hash, imageId); then sort the list by the m5Hash. Next, only do pairwise comparisons on the images that have the same md5Hash.

Jeff Kubina
+1 for starting at the middle
Stefan Monov
A: 

If these bitmaps are already on your graphics card then you can parallelize such a check by doing it on the graphics card using a language like CUDA or OpenCL.

I'll explain in terms of CUDA, since that's the one I know. Basically CUDA lets you write general purpose code to run in parallel across each node of your graphics card. You can access bitmaps that are in shared memory. Each invocation of the function is also given an index within the set of parallel runs. So, for a problem like this, you'd just run one of the above comparison functions for some subset of the bitmap - using parallelization to cover the entire bitmap. Then, just write a 1 to a certain memory location if the comparison fails (and write nothing if it succeeds).

If you don't already have the bitmaps on your graphics card, this probably isn't the way to go, since the costs for loading the two bitmaps on your card will easily eclipse the savings such parallelization will gain you.

Here's some (pretty bad) example code (it's been a little while since I programmed CUDA). There's better ways to access bitmaps that are already loaded as textures, but I didn't bother here.

// kernel to run on GPU, once per thread
__global__ void compare_bitmaps(long const * const A, long const * const B, char * const retValue, size_t const len)
{
 // divide the work equally among the threads (each thread is in a block, each block is in a grid)
 size_t const threads_per_block = blockDim.x * blockDim.y * blockDim.z;
 size_t const len_to_compare = len / (gridDim.x * gridDim.y * gridDim.z * threads_per_block);
# define offset3(idx3,dim3)  (idx3.x + dim3.x * (idx3.y + dim3.y * idx3.z))
 size_t const start_offset = len_to_compare * (offset3(threadIdx,blockDim) + threads_per_block * offset3(blockIdx,gridDim));
 size_t const stop_offset = start_offset + len_to_compare;
# undef offset3

 size_t i;
 for (i = start_offset; i < stop_offset; i++)
 {
  if (A[i] != B[i]) 
  {
   *retValue = 1;
   break;
  }
 }
 return;
}
rampion
+7  A: 

Reading this answer to a question regarding comparing byte arrays has yielded a MUCH FASTER method: using P/Invoke and the memcmp API call in msvcrt. Here's the code:

[DllImport("msvcrt.dll")]
private static extern int memcmp(IntPtr b1, IntPtr b2, long count);

public static bool CompareMemCmp(Bitmap b1, Bitmap b2)
{
    if ((b1 == null) != (b2 == null)) return false;
    if (b1.Size != b2.Size) return false;

    var bd1 = b1.LockBits(new Rectangle(new Point(0, 0), b1.Size), ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb);
    var bd2 = b2.LockBits(new Rectangle(new Point(0, 0), b2.Size), ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb);

    try
    {
        IntPtr bd1scan0 = bd1.Scan0;
        IntPtr bd2scan0 = bd2.Scan0;

        int stride = bd1.Stride;
        int len = stride * b1.Height;

        return memcmp(bd1scan0, bd2scan0, len) == 0;
    }
    finally
    {
        b1.UnlockBits(bd1);
        b2.UnlockBits(bd2);
    }
}
Erik Forbes
A: 

Bitwise exclusive OR the two bit strings and count the number of 1s (also know as Hamming Distance). Zero means they're identical. This has the added benefit (or not) of telling you how many pixels differ.

I don't know C#, so please consider the following perl snippet as pseudo-code:

sub bitmap_to_string {
    my ($file_name) = @_;

    open my $BITMAP, '<', $file_name or croak "Can't read file_name: $!";
    binmode $BITMAP;

    my ($buffer, $bit_string);
    while ( read $BITMAP, $buffer, 2**16
            and $bit_string .= $buffer)
        {}
    close $BITMAP or croak "Can't close $file_name: $!";

    return $bit_string;
}

sub _hamming_distance {
    my ($bit_string_1, $bit_string_2) = @_;
    return ($bit_string_1 ^ bit_string_2) =~ tr/\0//c;
}
Pedro Silva