views:

39

answers:

3

Hi all, im using php to create a sorta rss aggregator that stores data from multiple site rss feeds into a mysql database. since articles could be duplicated on many websites, i want to avoid this. ive been told you could use hashing to make unique hashes based on content of rss[description + title]. Now which hashing algorithm is fastest and produces less characters that i can use for comparison to avoid duplicates.

Thanx in advance

+1  A: 

To avoid false duplicates you should use a cryptographically secure hashing algorithm like SHA-1 or MD5.

Albin Sunnanbo
+1  A: 

MD5 is fastest and produces hash that is 32 characters long.

<?php
$hash = md5($description . $title);
?>

I used it in my RSS parser for exactly same purpose. And it works like a charm.

shamittomar
thanx for all your answers, but i think i'll take shammittomars answer as its 32 chars long, uses md5 and he understood my question. and has gone thru similar problem
Sir Lojik
+2  A: 

sprintf('%u',crc32()) produces 4,294,967,296 combinations, and it's shorter than md5 or sha1. it's only 32 bits wide.

stillstanding
You should pass the string as the argument of `crc32`, of course.
Daniel
it's the OP's option. he can use dechex(sprintf('%u',crc32()) if he wants a hex string, or just a plain left-zero-padded for pure decimal digits.
stillstanding
hmm..... 32bits wide. thanx for this solution
Sir Lojik
@DanielL... does that output an integer or string?
Sir Lojik
Remember as @stillstanding wrote, "the less number of characters generated by the hash function, the more likely you'll have collisions in your identifiers. Be certain about that.". MD5 is 128-bit so gives way more identifiers and much much lesser chance of collisions.
shamittomar
how about similar_text. is this worth doing
Sir Lojik
@Sir Lojik: Returns an integer.
Daniel