views:

359

answers:

2

Hi Everyone, I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

I am able to do it on local file systems using file:// protocol but not http protocol

A: 

plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) add this property in the nutch-site.xml file then you will crawl the pdf files

Sunil
A: 

Thanks Sunil for your reply,

But this does not help me. In the log I keep getting Generator: 0 records selected for fetching, exiting ... 2009-10-30 10:38:46,574 WARN crawl.Generator (Generator.java:generate(493)) - Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. 2009-10-30 10:38:46,574 INFO crawl.Crawl (Crawl.java:main(119)) - Stopping at depth=1 - no more URLs to fetch.

so after the fetch, I see that all the files are 1KB size and in Indexes I only find 6 files with 5 -- 1KB size and 1 - 0KB

It is not complete.

I am missing something. If anyone can try the URL I am using which is http://www.ontla.on.ca/library/repository/ser/140213/ and successful , please let me know.

Thanks everyone

Pramila