Nutch - how to crawl by small patches?

Hi everyone!

I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do:

Start to crawl my seeds with possibility to go further on outlinks.
Crawl 20000 pages, then index them.
Crawl another 20000 pages, index them and merge with first index.
Loop step 3 n times.

Tried also with scripts found in wiki, but all scripts i found don't go further. If i run them again, they do everything from beginning. And in the end of script i have the same index i had, when started to crawl. But, i need to continue my crawl.

Some help would be very usefull!

You have to understand the Nutch generate/fetch/update cycles.

The generate step of the cycle will take urls (you can set a max number with the topN parameter) from the crawl db and generate a new segment. Initially, the crawl db will only contain the seed urls.

The fetch step does the actual crawling. The actual content of the pages are stored in the segment.

Finally, the update step updates the crawl db with the results from the fetch (add new urls, set the last fetch time for an url, set the http status code of the fetch for an url, etc).

The crawl tool will run this cycle n times, configurable with the depth parameter.

After all cycles are complete, the crawl tool will delete all indexes in the folder from which it is launch and create a new one from all the segments and the crawl db.

So in order to do what you are asking, you should probably not use the crawl tool but instead call the individual Nutch commands, which is what the crawl tool is doing behind the scene. With that, you will be able to control how many times you crawl and also make sure that the indexes are always merge and not delete at each iteration.

I suggest you start with the script define here and change it to your needs.

Thank you! Now i understand how Nutch works. One question: crawl db in initial step is only with seed urls. I crawled, and got 100000 urls in my crawldb. When i start crawling again, and do not use -topN parameter, how much urls will nutch take to crawl from it`s crawldb?

Yurish 2010-04-19 06:39:30

If you do not specify a topN parameter, the generate command will take all the urls that are ready to be fetched and add them to the new segment. All the new urls that were discovered in the previous crawl will get fetched. For the urls that were already fetched, they will be fetch again only if they are due, according to the db.fetch.interval.default and db.fetch.interval.max parameters.

Pascal Dimassimo 2010-04-19 14:41:30

so, i even don't need to specify depth, yes? Nutch will take 100000 urls in one depth. Is it right?

Yurish 2010-04-20 07:40:37

yes, depth only applied to the crawl command to specify the number of generate/fetch/update cycles to do. And yes, the generate command should take all the urls ready to be fetched.

Pascal Dimassimo 2010-04-20 12:27:34

ansaurus

tags:

views:

answers:

Nutch - how to crawl by small patches?

related questions