tags:

views:

61

answers:

1

i can succesfully run crawl command via cygwin on windows xp. and i can also make web search via using tomcat.

but i also want to save parsed pages during crawling event

so when i start crawling with like this

bin/nutch crawl urls -dir crawled -depth 3

i also want save parsed html files to text files

i mean during this period which i started with above command

nutch when fetched a page it will also automaticly save that page parsed (only text) to text files

these files names could be fetched url

i really need help about this

this will be used at my university language detection project

ty

A: 

The crawled pages are stored in the segments. You can have access to them by dumping the segment content:

nutch readseg -dump crawl/segments/20100104113507/ dump

You will have to do this for each segment.

Pascal Dimassimo