views:

513

answers:

5

I hope this question isn’t too “right field” and I'll be upfront in saying I'm a newb compared to many people on stackflow...

I want to compare object representations of images, audio and text for an AI project I am working on. I'd like to convert all three inputs into a single data type and use a central comparison algorithm to determine statically probable matches.

What are the “fastest” native .Net and SQL data types for making comparisons like this? In .Net what data type requires the least amount of conversions in the CLR? For SQL, what type can be “CRUD-ed” the fastest?

I was thinking bytes for .Net and integers for SQL but integers pose a problem of being a one dimensional concept. Do you think the images and audio should be handled within the file system rather than SQL…I’m guessing so…

FWIW I'm building a robot from parts I bought at TrossenRobotics.com

A: 

Personally, I'd say you're best off using a byte array. You can easily read the file in to the buffer...and from the buffer into the byte array where you can do the comparison.

Justin Niessner
On the contrary, I would rather use an int array - the x86 uses 32-bit words, so comparing two bytes takes at least as much times as comparing two 32-bit integers. I say "at least" because the CPU still has to do the padding which also takes some time. So basically, by using an int array, the operation would become at least four times faster.
DrJokepu
+2  A: 

Personally, if you need to do frequent comparisons between large binary objects, I would hash the objects and compare the hashes.

If the hashes don't match, then you can be sure the objects don't match (which should be the majority of the cases).

If the hashes do match, you can then start a more lengthy routine to compare the actual objects.

This method alone should boost your performance quite a bit if you're comparing these objects frequently.

rein
Good point. I'd expect the text to come up with identical matches but rarely (if ever) the audio and video. I think hashing is costly but I'd be processing less text, I'll look into adding that.
nbdeveloper
A: 

As far as I recall, in terms of sheer performance, the Int32 type is among the faster data types of .NET. Can't say whether it is the most suitable in your application though.

Fredrik Mörk
+1  A: 

Speed of data types is a bit hard to measure. It makes a big difference if you're using a 32-bits operating system or a 64-bits. Why? Because it determines the speed at which this data can be processed. In general, on a 32-bits system, all data types that fit inside 32 bits (int16, int32, char, byte, pointers) will be processed as the same speed. If you need lots of data to be processed, it's best to divide it in blocks of four bytes each for your CPU to process them.

However, when you're writing data to disk, data speed tends to depend on a lot more factors. If your disk device is on some USB port, all data gets serialized, thus it would be byte after byte. In that case, size doesn't matter much, although the smallest datablocks would leave the smallest gaps. (In languages like Pascal you'd use a packed record for this kind of data to optimize streaming performance, while having your fields in your records aligned at multiples of 4 bytes for CPU performance.) Regular disks will store data in bigger blocks. To increase reading/writing speed, you'd prefer to make your data structures as compact as possible. But for processing performance, having them aligned on 4 bytes boundaries is more effective.

Which reminds me that I once had a discussion with someone about using compression on an NTFS disk. I managed to prove that compressing an NTFS partition could actually improve the performance of a computer since it had to read a lot less data blocks, even though it meant it had to do more processing to decompress the same data blocks.

To improve performance, you just have to find the weakest (slowest) link and start there. Once it's optimized, there will be another weak link...

Workshop Alex
A: 

Before pulling anything into .NET, you should check the length of the data in SQL Server using the LEN function. If the length is different, you know already that the two objects are different. This should save bringing down lots of unnecessary data from SQL Server to your client application.

I would also recommend storing a hash code (in a separate column from the binary data) using the CHECKSUM function (http://msdn.microsoft.com/en-us/library/aa258245(SQL.80).aspx). This will only work if you are using SQL Server 2005 and above and you are storing your data as varbinary(MAX). Once again, if the hash codes are different, the binary data is definitely different.

If you are using SQL Server 2000, you are stuck with the 'image' data type.

Both image or varbinary(MAX) will map nicely to byte[] objects on the client, however if you are using SQL Server 2008, you have the option of storing your data as a FILESTREAM data type (http://blogs.msdn.com/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx).

John JJ Curtis