views:

1682

answers:

4

I am trying to compare two long bytearrays in vb.net and have run into a snag. Comparing two 50 meg files takes almost two mins so I'm clearly doing something wrong. I'm on an x64 machine with tons of memory so there are no issues there. Here is the code that I'm using at the moment and would like to change.

_Bytes and item.Bytes ar ethe two different arrays to compare and are already the same length.

For Each B In item.Bytes
   If B <> _Bytes(I) Then
        Mismatch = True
        Exit For
   End If
   I += 1
Next

I need to be able to compare as fast as possible files that are potentially hundreds of megs and even possibly a gig or two. Any suggests or algorithms that would be able to do this faster?

[Edit]

Item.bytes is an object taken from the DB/filesystem that is returned to compare because it’s byte length matches the item that the user wants to add. By comparing the two arrays I can then determine if the user has added something new to the DB and if not then I can just map them to the other file and not waste HD space.

[Update]

So I converted the arrays to local variables of Byte() and then did the same comparison, same code and it ran in like 1 sec (I have to benchmark it still and compare it to others) but if you do the same thing with local variables and use a generic array it becomes massively slower. I’m not sure why but it raises a lot more questions for me about the use of arrays.

+3  A: 

If you don't need to know the byte, uses 64bit ints that gives you 8 at once. Actually, you can figure out the wrong byte, once you've isolated it to a set of 8.

Use BinaryReader

saveTime = binReader.ReadInt32()

or for arrays of ints.

Dim count As Integer = binReader.Read(testArray, 0, 3)

sfossen
Can you explain that further please?
Middletone
use arrays of int instead or arrays of bytes.
sfossen
So given that these are byte arrays from a file how do I get them into this blocked format that you speak of?
Middletone
Either read the file as binary one 64-bit int at a time, or after reading 8 bytes, place them into a 64-bit int, using bit-shifts and bitwise-or.
greyfade
@Middletone: check the link, and use the BinaryReader.
sfossen
+9  A: 

What is the _Bytes(I) call doing? It's not loading the file each time, is it? Even with buffering, that would be bad news!

There will be plenty of ways to micro-optimise this in terms of looking at longs at a time, potentially using unsafe code etc - but I'd just concentrate on getting reasonable performance first. Clearly there's something very odd going on.

I suggest you extract the comparison code into a separate function which takes two byte arrays. That way you know you won't be doing anything odd. I'd also use a simple For loop rather than For Each in this case - it'll be simpler. Oh, and check whether the lengths are correct first :)

EDIT: Here's the code (untested, but simple enough) that I'd use. It's in C# for the minute - I'll convert it in a sec:

public static bool Equals(byte[] first, byte[] second)
{
    if (first == second)
    {
        return true;
    }
    if (first == null || second == null)
    {
        return false;
    }
    if (first.Length != second.Length)
    {
        return false;
    }
    for (int i=0; i < first.Length; i++)
    {
        if (first[i] != second[i])                
        {
            return false;
        }
    }
    return true;
}

EDIT: And here's the VB:

Public Shared Function ArraysEqual(ByVal first As Byte(), _
                                   ByVal second As Byte()) As Boolean
    If (first Is second) Then
        Return True
    End If

    If (first Is Nothing OrElse second Is Nothing) Then
        Return False
    End If
    If  (first.Length <> second.Length) Then
         Return False
    End If

    For i as Integer = 0 To first.Length - 1
        If (first(i) <> second(i)) Then
            Return False
        End If
    Next i
    Return True
End Function
Jon Skeet
the _Bytes(I) is an array of bytes that is already in memory.
Middletone
i is just an index to find the byte
Middletone
Hmm... and item.Bytes?
Jon Skeet
Item.bytes is the other in-memory array it's being comapred with..
Middletone
Amazing Jon, I'm thrilled to have the Stack Celeb helping out on this!
Middletone
Try the code I've just provided - but I'm really surprised it was taking two minutes. The above code takes about 140ms on my laptop (built with optimisation, and not running under a debugger, admittedly).
Jon Skeet
Hi Jon, your first condition in the VB code has a `Not` too many. The parens aren't needed either but they don't do any harm (`Is` == `object.ReferenceEquals` == roughly `==` for references when no `operator ==` is defined).
Konrad Rudolph
Thanks Konrad - Reflector had refactored the code somewhat, and I missed that "Not" when trying to undo its work :)
Jon Skeet
A: 

I see two things that might help:

First, rather than always accessing the 2nd array as item.Bytes, use a local var to point directly at the array. That is, before starting the loop, do something like this:

 array2 = item.Bytes

That will save the overhead of dereferencing from the object each time you want a byte. That could be expensive in VB, esp. if there's a Getter method on that property.

Also, use a "definite loop" instead of "for each". You already know the length of the arrays, so just code the loop using that value. This will avoid the overhead of treating the array as a collection. The loop would look something like this:

for i = 1 by 1 to max
 if (array1(i) <> array2(i)) exit for
next
Clayton
A: 

Not strictly related to the comparison algorithm:

Are you sure your bottleneck is not related to the memory available and the time used to load the byte arrays? Loading two 2GB byte arrays just to compare them could bring most machines to their knees. If the program design allows, why don't you try using streams to read smaller chunks instead?

Sergio Acosta