ansaurus

Question

Another class instead of SHA1Managed to making Checksum's with fewer than 128 length bytes

Answer 1

+1 A:

No, this is not a good approach.

A million records is no big deal for an indexed field. On the other hand, any checksum/hash/whatever you generate is capable of false positives due to the pigeonhole principle (aka birthday paradox). Making it bigger reduces but does not eliminate this chance, but it does slow things down to the point where there will be no speed increase.

Just slap an index on the field and see what happens.

Steven Sudit 2010-09-27 17:15:59

OK, i'm using index on this table. but index (clustered) are on UrlId (Primary Key) field.

Sadegh 2010-09-27 19:18:02

Put an index on AbsoluteUrl.

Steven Sudit 2010-09-27 21:46:54

Not that you should use a hash here, but just for the record: If you want to preserve the distribution properties of a hash but expect to have a dataset small enough that the bit width is overkill in terms of protecting you from the birthday paradox, nothing stops you from XORing the top and bottom parts, and even repeating this until you get the size you want. The catch is that you pay the full cost of creating the hash, and if you overdo it, then you'll suffer from the birthday paradox.

Steven Sudit 2010-09-28 14:38:51

Answer 2

+2 A:

You want to use a hash of size (p) as a key, expecting at most 1m records (u). To answer this question you have to first do the math...

Solve the following for each hash size to consider: 1 - e ^ (-u^2 / (2 * p))

32-bit: 100% chance of collision
64-bit: 0.00000271% chance of collision
128-bit: 0% (too small to calculate with a double precision)

Now you should have enough information to make an informed decision. Here is the code to produce the above calculation on the 64-bit key:

double keySize = 64;
double possibleKeys = Math.Pow(2, keySize);
double universeSize = 1000000;
double v1, v2;
v1 = -Math.Pow(universeSize, 2);
v2 = 2.0 * possibleKeys;
v1 = v1 / v2;
v1 = Math.Pow(2.718281828, v1);
v1 = 1.0 - v1;
Console.WriteLine("The resulting percentage is {0:n40}%", v1 * 100.0);

Personally I'd stick with at least a 128 bit hash myself. Moreover if collisions can cause any form of a security hole you need to use at least a v2 SHA hash (SHA256/SHA512).

Now, If this is just an optimization for the database consider the following:

add a 32-bit hash code to the table.
create a composite key containing both the 32-bit hash AND the original string.
ALWAYS seek on both the hash and the original string.
Assume the hash is only an optimization and never unique.

csharptest.net 2010-09-27 18:30:13

Thanks, your answer is very useful. But i dont want to involve with this unmanaged codes. i more want to use from .NET Framework classes to solve this issue!

Sadegh 2010-09-27 19:20:40

Collisions are *always* possible with hashes, so even if it were a good idea to use them for speed (and it's not), you'd still need to filter the results to ensure that they're correct.

Steven Sudit 2010-09-27 21:48:08

@Steven, while I'll agree that in principal it's not a good idea, I'll shy away from saying it's 'never' acceptable to use a hash as a key. I will say you must understand the probability, the ramifications, and be willing to accept the consequences of the collision. I think the real issue is that it often does not produce a performance gain and can degrade performance (and I think it will for the OP's updated example).

csharptest.net 2010-09-27 22:34:44

Using hashes to speed searches is old hat, but relies on quick, small hashes that produce a probabilistic match. You could use a large, well-distributed hash, such as SHA-2, but the size subverts most of the potential benefit. And, due to the birthday paradox, a collision is counter-intuitively easy. It doesn't help that these hashes are also susceptible to extension attacks. So, in short, while I wouldn't say "never", either, I'd put a lot of conditions on when it's even plausible.

Steven Sudit 2010-09-28 01:14:59

Answer 3

+2 A:

I agree with Steven that you should first try an index on the field to see if it really is "comparing each records" that is the bottleneck.

However, depending on your database, indexing an NVARCHAR(2048) may not be possible, and really could be the bottleneck. In that case generating checksums actually could improve your search performance if:

You do many more comparisons than inserts.
Comparing the checksum is faster than comparing NVARCHARs.
Most of your checksums are different.

You have not shown us any queries or sample data, so I have no way of knowing if these are true. If they are true, you can indeed improve performance by generating a checksum for each AbsoluteUrl and assuming values are different where these checksums are different. If the checksums are the same, you will have to do a string comparison to see if values match, but if checksums are different you can be sure the strings are different.

In this case a cryptographic checksum is not necessary, you can use a smaller, faster checksum algorithm like CRC64.

As Steven points out, if your checksums are the same you cannot assume your values are the same. However, if most of your values are different and you have a good checksum, most of your checksums will be different and will not require string comparisons.

Dour High Arch 2010-09-27 18:32:42

Thanks, so what is the implementation of CRC64 on .NET Framework?

Sadegh 2010-09-27 19:15:55

Using a search engine shows many implementations including http://hashlib.codeplex.com/, http://www.codeguru.com/forum/showthread.php?p=1963294#post1963294

Dour High Arch 2010-09-27 20:48:15

@Dour: A string comparison is always necessary because it may be that you're searching for `A` but only `B` is in the table and `H(A)==H(B)`. In fact, one issue is to ensure that the execution plan checks the hash *before* checking the actual value.

Steven Sudit 2010-09-27 21:50:13

@Steven, a string search is only necessary in your specific case of H(A) == H(B), it is not "always" necessary. With a well-chosen hash, the hash comparisons can be reduced to a tiny fraction of string comparisons.

Dour High Arch 2010-09-28 17:37:50

Hash comparisons can't have false negatives, but they certainly can have false positives, which is why we need to do a direct comparison with the candidate to ensure that it's a genuine match.

Steven Sudit 2010-09-28 18:21:19

@Steven, but only for false positives, which for any half-decent hash will be rare. Hardly "always".

Dour High Arch 2010-09-28 19:58:19

That depends. Due to the birthday paradox, even a perfectly distributed hash will tend to collide as the number of items approaches 2^(n/2), where n is the bit width of the hash. This means that a 32-bit hash will collide repeatedly when fed tens of thousands of items. However, this is just the best case, since real hashes tend to have weaknesses. There is a tension between a hash being fast enough to save you any time and a hash being collision-resistant. (continued)

Steven Sudit 2010-09-28 20:26:23

Worse, any hash that has a Merkle–Damgård construction (and that's all of the cryptographic ones!) is vulnerable to inadvertent extension attacks, among other issues. So, really, it's a catch-22. Perhaps the risk of false negatives from SHA-512 is negligible, but the cost of calculating it makes it a poor speed-up. On the other hand, anything that provides a speed-up is likely to offer a non-trivial risk of collisions. In practice, hash tables favor fast hashes and handle collisions by direct comparison.

Steven Sudit 2010-09-28 20:28:39

ansaurus

tags:

views:

answers:

Another class instead of SHA1Managed to making Checksum's with fewer than 128 length bytes

related questions