tags:

views:

15

answers:

2

Hi all, Basically, i want my site to aggregate a lot of rss feeds and store them in database during cron job. i use magpie to parse the rss into arrays...everything should seem straight forward although im worried about duplication issues when running the cron job.

what is the best solution to avoid duplicate entries.... here is my theory although i dnt think its efficient.

cron job theory

1) parse rss feed with magpie 2) create md5 hash of link 3) test for existance of md5 in database table... if not ... insert .. if exists ignore or update

lemme know if there is a more efficient way

+1  A: 

Since you are worried about duplication issues, how is it even going to end up duplicated? If it's found on several different sites, I suppose it's better idea to find MD5 of first sentence of the article or something.

Tech163
great answer, so i dnt even need to do md5 since links will always be unique. i suppose its better than md5 of first sentence since multiple sites could have same first line :-)
Sir Lojik
+1  A: 

Links may not be the enough because articles are duplicated on several sites. I once made a system to collect articles from a lot of newspapers where the same article can appear in multiple sources. Also a site may publish the same article on multiple URL's, for example when an article is presented in multiple categories.

If you really want to be sure an article is not a duplicate, compare the content or a hashed code based on it.

Kwebble