bounding Heritrix depth

tags:

web-crawler

views:

answers:

+1 Q:

bounding Heritrix depth

Hi,

I am new to Heritrix and using heritirx 1.14. I dont know how to do the following: 1) bound the BFS depth of downloaded links to a specific number, for example to 3. 2) restrict the downloaded types to html and text.

I highly appreciate your attention.

First of all, I may be confusing concepts from Heritrix 2 (which I use more) with Heritrix 1 (which I haven't used for quite a while). Sorry if I do.

The depth is a scope setting on the frontier. BroadScope would have a depth limiting setting. Or you can have scope with a DecidingScope.

As for what file types to download, I believe that should be set on the MirrorWriterProcessor you try to use to archive the crawled files (it is a sequence of DecideRules in 2.x).

By the way, wget / httrack are more easy to configure for this type of task, at least if you just need to have the most current copy of the webpage(s) in question.

Radtoo 2010-06-20 12:44:09

thanks a lot Radtoo, but I could not find the option for the Depth in settings tab.

Mohsen Ghafoorian 2010-06-20 13:23:13

1) bound the BFS depth of downloaded links to a specific number, for example to 3.

Set the max-link-hops to 3. See 6.3.2. Scope settings from the manual.

2) restrict the downloaded types to html and text.

Configure this in a ContentTypeRegExpFilter to only match text/plain and text/html. See section 6.2.2.2. Provided filters from the manual.

Bart Kiers 2010-06-21 20:04:57

ansaurus

tags:

views:

answers:

bounding Heritrix depth

related questions