ansaurus

Question

Crawler Coding: determine if pages have been crawled?

Answer 1

+1 A:

If you want to store these data for long term then use database. You can store crawled m URLs and their n URLs in database with their statuses. When you are going to crawl again first check database for crawled URLs.

For example:

Store your mURLs in mtable something like this:

 id |        mURL           | status       |    crawlingDate
------------------------------------------------------------------
 1  | example.com/one.php   | crawled      |   01-01-2010 12:30:00
 2  | example.com/two.php   | crawled      |   01-01-2010 12:35:10
 3  | example.com/three.php | not-crawled  |   01-01-2010 12:40:33

Now fetch each mURL from mtable and get all n URLs and store it in ntable something like this:

 id |        nURL             | mURL_id |  status      | crawlingDate
----------------------------------------------------------------------------
 1  | www.one.com/page1.php   |    1    |  crawled     | 01-01-2010 12:31:00
 2  | www.one.com/page2.php   |    1    |  crawled     | 01-01-2010 12:32:00
 3  | www.two.com/page1.php   |    2    |  crawled     | 01-01-2010 12:36:00
 4  | www.two.com/page2.php   |    2    |  crawled     | 01-01-2010 12:37:00
 5  | www.three.com/page1.php |    3    |  not-crawled | 01-01-2010 12:41:00
 6  | www.three.com/page2.php |    3    |  not-crawled | 01-01-2010 12:42:00

When you crawl next time first fetch all record from mtable one by one and get all nURLs from each mURL. Now store all nURLs in ntable if it does not already exists. Now start crawling each nURL to get data where status is not-crawled and set status to crawled when done. When all nURLs for one mURL are done then you can set status to crawled for that mURL in mtable.

If you don't want to use database and want to run crawler once for all then you can use this logic in arrays.

Probably this help to give you a direction.

NAVEED 2010-08-28 01:04:27

Thank you very much naveed! This has already given me lots of ideas and a path forward. One thing: don't i need to keep track of crawling date/time so that I know when to re-crawl rach site/page?

iCrawly 2010-08-28 11:13:25

You can add datetime column for both tables. When nURL crawling is done then set current time in date column of ntable and when all nURLs of a mURL are done then set current time in date column of mtable.

NAVEED 2010-08-28 11:29:33

When you crawl next time, you can check any mURL crawlingDate and compare it with current date. If difference is 24 hours(for example) then crawl its nURLs otherwise move to next mURL.

NAVEED 2010-08-28 11:41:27

thank you very much for these clarifications :) If not too much trouble, would you mind explaining how one could see if a page "doesn't exist" (has been removed from a site) give an url. I'm thinking about checking the http response header for 404 and other codes but then i realized that some sites just change the contents of the page instead ("... Sorry, this page has been removed ...").

iCrawly 2010-08-28 12:10:53

See if this can help you: http://hungred.com/how-to/php-check-remote-email-url-image-link-exist/If It does not help you can ask another question about this and you will get useful answers for this issue from other users as well.

NAVEED 2010-08-28 12:19:28

Yes, that was quite interesting! Many thanks for all your help!

iCrawly 2010-08-28 12:43:41

ansaurus

tags:

views:

answers:

Crawler Coding: determine if pages have been crawled?

related questions