ansaurus

Question

how to make nutch crawl file system?

Answer 1

+1 A:

nutch has the Intranet crawling available. you can read the details here

Sumit Ghosh 2009-06-12 18:25:53

Answer 2

+2 A:

From the Nutch Wiki:

How do I index my local file system?

http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6

1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. Change this line:

  -^(file|ftp|mailto|https):

  to this:

  -^(http|ftp|mailto|https):

2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:

  # accept anything else +.*

3) I changed my nutch.xml to include the following:

<Parameter override="false" name="plugin.includes" value="protocol-file|protocol-http|urlfilter-regex|parse-(msword|pdf|text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)"/>

Robert Nickens 2009-07-12 03:39:23

ansaurus

tags:

views:

answers:

how to make nutch crawl file system?

related questions