tags:

views:

133

answers:

4

Hi,

I have a program which i should ensure that a URL exist or not, if exists in the database, i should select the ID if not i should insert it to the database.

I have a question, Is GetHashCode is a good approach to save the hash code in the database and just compare the hash codes? Can I be sure there is no exception which 2 or more URLs has equal hash codes and if not Is it different which .NET Framework is installed?

Thanks

A: 

No, it is not a good idea - because the GetHashcode() might return different results the next .net framework version. see msdn remarks

tanascius
Tnx so much, how about MD5 for URLs?
Hossein Margani
I think MD5 is ok - but remember to normalize your URLs before using MD5 or any other hash function
tanascius
+1  A: 
  1. Don't use the out of the box GetHashCode(), it is week and might change in the next version.
  2. Use your own hash function using SHA1/SHA2.
  3. You need to deal with escaping, I.E. 'A B'== 'A%20B'
  4. You also need to consider what to-do with case sensitivity.
Shay Erlichmen
Tnx so much, how about MD5 for URLs?
Hossein Margani
MD5 is broken: http://www.microsoft.com/technet/security/advisory/961509.mspx.
Shay Erlichmen
Broken only for cryptographic usages - it should be absolutely no problem in this case
tanascius
if the use of the URL comes from the user it is an issue, they use can generate a collision in the DB that instead of going to lets say google.com it will goto mybadsite.com/this_url_can_have_the_same_md5_hash_as_goole_com.htm
Shay Erlichmen
A: 

Don't use it as an identity - GetHashCode may result in same value for different strings.

GetHashCode result is an int32, so it may store only 4e9 different values. Since number of webpages is already around these value (http://everything2.com/index.pl?node_id=1268366), you can be almost sure that some different urls generate same hash.

A: 

If you really want to make sure no duplicates exists, you should just store the URL. The only thing you could do with a hash is use it as an first indicator if the URL might exists, but basically your doing the indexing manually while a good DB could do this for you.

Apart from how to store it, there are different ways to represent the same URL in a different string, it might be a good idea to specify how unique you want the URLs to be?

Davy Landman
Hi, Tnx many, I want to have them equal for whole the URL.
Hossein Margani
but http://www.google.com, http://www.GOOGLE.com, http://74.125.79.147, http://%77%77%77%2e%67%6f%6f%67%6c%65%2e%63%6f%6d, http://1249723236 and http://74.0175.0x4364 all lead to the same page, so how unique do you want it?
Davy Landman
it does not important, i want just an ID from my DB, to be sure if I have a link then I have an ID, not 2 or more ID.
Hossein Margani
I supplied 5 different urls, should each of those have a unique ID or should they all have the same ID?
Davy Landman
they can have their own ID, it's not imp.
Hossein Margani