views:

71

answers:

1

I am working on a crawler in PHP that expects m URLs at which it finds a set of n links to n pages (internal pages) which are crawled for data. Links may be added or removed from the n set of links. I need to keep track of the links/pages so that i know which have been crawled, which ones are removed and which ones are new.

How should i go about to keep track of which m and n pages are crawled so that next crawl fetches new urls, re-checks still existing urls and ignores obsolete urls?

+1  A: 

If you want to store these data for long term then use database. You can store crawled m URLs and their n URLs in database with their statuses. When you are going to crawl again first check database for crawled URLs.

For example:

Store your mURLs in mtable something like this:

 id |        mURL           | status       |    crawlingDate
------------------------------------------------------------------
 1  | example.com/one.php   | crawled      |   01-01-2010 12:30:00
 2  | example.com/two.php   | crawled      |   01-01-2010 12:35:10
 3  | example.com/three.php | not-crawled  |   01-01-2010 12:40:33

Now fetch each mURL from mtable and get all n URLs and store it in ntable something like this:

 id |        nURL             | mURL_id |  status      | crawlingDate
----------------------------------------------------------------------------
 1  | www.one.com/page1.php   |    1    |  crawled     | 01-01-2010 12:31:00
 2  | www.one.com/page2.php   |    1    |  crawled     | 01-01-2010 12:32:00
 3  | www.two.com/page1.php   |    2    |  crawled     | 01-01-2010 12:36:00
 4  | www.two.com/page2.php   |    2    |  crawled     | 01-01-2010 12:37:00
 5  | www.three.com/page1.php |    3    |  not-crawled | 01-01-2010 12:41:00
 6  | www.three.com/page2.php |    3    |  not-crawled | 01-01-2010 12:42:00

When you crawl next time first fetch all record from mtable one by one and get all nURLs from each mURL. Now store all nURLs in ntable if it does not already exists. Now start crawling each nURL to get data where status is not-crawled and set status to crawled when done. When all nURLs for one mURL are done then you can set status to crawled for that mURL in mtable.

If you don't want to use database and want to run crawler once for all then you can use this logic in arrays.

Probably this help to give you a direction.

NAVEED
Thank you very much naveed! This has already given me lots of ideas and a path forward. One thing: don't i need to keep track of crawling date/time so that I know when to re-crawl rach site/page?
iCrawly
You can add datetime column for both tables. When nURL crawling is done then set current time in date column of ntable and when all nURLs of a mURL are done then set current time in date column of mtable.
NAVEED
When you crawl next time, you can check any mURL crawlingDate and compare it with current date. If difference is 24 hours(for example) then crawl its nURLs otherwise move to next mURL.
NAVEED
thank you very much for these clarifications :) If not too much trouble, would you mind explaining how one could see if a page "doesn't exist" (has been removed from a site) give an url. I'm thinking about checking the http response header for 404 and other codes but then i realized that some sites just change the contents of the page instead ("... Sorry, this page has been removed ...").
iCrawly
See if this can help you: http://hungred.com/how-to/php-check-remote-email-url-image-link-exist/If It does not help you can ask another question about this and you will get useful answers for this issue from other users as well.
NAVEED
Yes, that was quite interesting! Many thanks for all your help!
iCrawly