views:

45

answers:

1

How can i set maximum number of pages to index per host? i don't want to index all million pages of site, i want to index only first 100000 found pages.

A: 

With depth=10 and topN=1000, you will not have more than 10000 documents in your index (if you don't re-crawl). The 'depth' parameter indicates how many iterations Nutch will run. The 'topN' parameter controls how much urls at maximum will be fetched during one iteration. So multiplying 'depth' by 'topN' gives an approximation of how many urls will be indexed. It is an approximation because you might have urls that will timed-out or return a 404.

If you don't want to re-crawl, make sure the 'db.fetch.interval.default' is set with a high enough value for the crawl job to complete. If the crawl job is not completed when that interval expires, then you will start re-crawling some urls and so the number of urls indexed will be less than depth*topN.

Pascal Dimassimo
i just thought to have option to limit the number of inlinks per domain, even if i have depth 100 and topN 10000 it will crawl only first 10000 links and will not add more inlinks.
Did it complete 100 iterations?
Pascal Dimassimo