views:

70

answers:

3

hi, i have a table that have one column (AbsoluteUrl NVARCHAR(2048)) and i want to querying on this column, so this took long time to comparing each records with my own string. at least this table have 1000000 records.

Now i think there is better solution to making a checksum for each AbsoluteUrl and compare to checksum together instead of to AbsoluteUrl column. so i'm use below method to generate checksum. but i want another class to making checksum's with fewer than 128 length bytes.

public static byte[] GenerateChecksumAsByte(string content)
    {
        var buffer = Encoding.UTF8.GetBytes(content);
        return new SHA1Managed().ComputeHash(buffer);
    }

And is this approach good for my work?

UPDATE

According to answers, i want to explain in more depth. so actually I'm work on very simple Web Search Engine. If I want to briefly explain that I have to say when all of urls of web page are extracted (collection of found urls) then I'm going to index that to Urls table.

UrlId uniqueidentifier NotNull Primary Key (Clustered Index) AbsoluteUrl nvarchar(2048) NoyNull Checksum varbinary(128) NotNull

So i first search the table to if i have same url which is indexed before or not. if not then create new record.

public Url Get(byte[] checksum)
    {
        return _dataContext.Urls.SingleOrDefault(url => url.Checksum == checksum);
        //Or querying by AbsoluteUrl field
   }

And Save method.

public void Save(Url url)
    {
        if (url == null)
            throw new ArgumentNullException("url");
        var origin = _dataContext.Urls.GetOriginalEntityState(url);
        if (origin == null)
        {
            _dataContext.Urls.Attach(url);
            _dataContext.Refresh(RefreshMode.KeepCurrentValues, url);
        }
        else
            _dataContext.Urls.InsertOnSubmit(url);
        _dataContext.SubmitChanges();
    }

For example if on one page i found 2000 urls, i must search for 2000 times.

+1  A: 

No, this is not a good approach.

A million records is no big deal for an indexed field. On the other hand, any checksum/hash/whatever you generate is capable of false positives due to the pigeonhole principle (aka birthday paradox). Making it bigger reduces but does not eliminate this chance, but it does slow things down to the point where there will be no speed increase.

Just slap an index on the field and see what happens.

Steven Sudit
OK, i'm using index on this table. but index (clustered) are on UrlId (Primary Key) field.
Sadegh
Put an index on AbsoluteUrl.
Steven Sudit
Not that you should use a hash here, but just for the record: If you want to preserve the distribution properties of a hash but expect to have a dataset small enough that the bit width is overkill in terms of protecting you from the birthday paradox, nothing stops you from XORing the top and bottom parts, and even repeating this until you get the size you want. The catch is that you pay the full cost of creating the hash, and if you overdo it, then you'll suffer from the birthday paradox.
Steven Sudit
+2  A: 

You want to use a hash of size (p) as a key, expecting at most 1m records (u). To answer this question you have to first do the math...

Solve the following for each hash size to consider: 1 - e ^ (-u^2 / (2 * p))

  • 32-bit: 100% chance of collision
  • 64-bit: 0.00000271% chance of collision
  • 128-bit: 0% (too small to calculate with a double precision)

Now you should have enough information to make an informed decision. Here is the code to produce the above calculation on the 64-bit key:

double keySize = 64;
double possibleKeys = Math.Pow(2, keySize);
double universeSize = 1000000;
double v1, v2;
v1 = -Math.Pow(universeSize, 2);
v2 = 2.0 * possibleKeys;
v1 = v1 / v2;
v1 = Math.Pow(2.718281828, v1);
v1 = 1.0 - v1;
Console.WriteLine("The resulting percentage is {0:n40}%", v1 * 100.0);

Personally I'd stick with at least a 128 bit hash myself. Moreover if collisions can cause any form of a security hole you need to use at least a v2 SHA hash (SHA256/SHA512).

Now, If this is just an optimization for the database consider the following:

  1. add a 32-bit hash code to the table.
  2. create a composite key containing both the 32-bit hash AND the original string.
  3. ALWAYS seek on both the hash and the original string.
  4. Assume the hash is only an optimization and never unique.
csharptest.net
Thanks, your answer is very useful. But i dont want to involve with this unmanaged codes. i more want to use from .NET Framework classes to solve this issue!
Sadegh
Collisions are *always* possible with hashes, so even if it were a good idea to use them for speed (and it's not), you'd still need to filter the results to ensure that they're correct.
Steven Sudit
@Steven, while I'll agree that in principal it's not a good idea, I'll shy away from saying it's 'never' acceptable to use a hash as a key. I will say you must understand the probability, the ramifications, and be willing to accept the consequences of the collision. I think the real issue is that it often does not produce a performance gain and can degrade performance (and I think it will for the OP's updated example).
csharptest.net
Using hashes to speed searches is old hat, but relies on quick, small hashes that produce a probabilistic match. You could use a large, well-distributed hash, such as SHA-2, but the size subverts most of the potential benefit. And, due to the birthday paradox, a collision is counter-intuitively easy. It doesn't help that these hashes are also susceptible to extension attacks. So, in short, while I wouldn't say "never", either, I'd put a lot of conditions on when it's even plausible.
Steven Sudit
+2  A: 

I agree with Steven that you should first try an index on the field to see if it really is "comparing each records" that is the bottleneck.

However, depending on your database, indexing an NVARCHAR(2048) may not be possible, and really could be the bottleneck. In that case generating checksums actually could improve your search performance if:

  1. You do many more comparisons than inserts.
  2. Comparing the checksum is faster than comparing NVARCHARs.
  3. Most of your checksums are different.

You have not shown us any queries or sample data, so I have no way of knowing if these are true. If they are true, you can indeed improve performance by generating a checksum for each AbsoluteUrl and assuming values are different where these checksums are different. If the checksums are the same, you will have to do a string comparison to see if values match, but if checksums are different you can be sure the strings are different.

In this case a cryptographic checksum is not necessary, you can use a smaller, faster checksum algorithm like CRC64.

As Steven points out, if your checksums are the same you cannot assume your values are the same. However, if most of your values are different and you have a good checksum, most of your checksums will be different and will not require string comparisons.

Dour High Arch
Thanks, so what is the implementation of CRC64 on .NET Framework?
Sadegh
Using a search engine shows many implementations including http://hashlib.codeplex.com/, http://www.codeguru.com/forum/showthread.php?p=1963294#post1963294
Dour High Arch
@Dour: A string comparison is always necessary because it may be that you're searching for `A` but only `B` is in the table and `H(A)==H(B)`. In fact, one issue is to ensure that the execution plan checks the hash *before* checking the actual value.
Steven Sudit
@Steven, a string search is only necessary in your specific case of H(A) == H(B), it is not "always" necessary. With a well-chosen hash, the hash comparisons can be reduced to a tiny fraction of string comparisons.
Dour High Arch
Hash comparisons can't have false negatives, but they certainly can have false positives, which is why we need to do a direct comparison with the candidate to ensure that it's a genuine match.
Steven Sudit
@Steven, but only for false positives, which for any half-decent hash will be rare. Hardly "always".
Dour High Arch
That depends. Due to the birthday paradox, even a perfectly distributed hash will tend to collide as the number of items approaches 2^(n/2), where n is the bit width of the hash. This means that a 32-bit hash will collide repeatedly when fed tens of thousands of items. However, this is just the best case, since real hashes tend to have weaknesses. There is a tension between a hash being fast enough to save you any time and a hash being collision-resistant. (continued)
Steven Sudit
Worse, any hash that has a Merkle–Damgård construction (and that's all of the cryptographic ones!) is vulnerable to inadvertent extension attacks, among other issues. So, really, it's a catch-22. Perhaps the risk of false negatives from SHA-512 is negligible, but the cost of calculating it makes it a poor speed-up. On the other hand, anything that provides a speed-up is likely to offer a non-trivial risk of collisions. In practice, hash tables favor fast hashes and handle collisions by direct comparison.
Steven Sudit