tags:

views:

41

answers:

1

Hi ,

I have a question about crawling the files that are accessable via http. I am talking about pdf files.

I am not able to do it using Nutch 1.0. the protocol I am using is similar to this http://www.ontla.on.ca/library/repository/ser/140213/2006/

but I do not see any data fetched. the files generated are 1kb.

But on Local file system with file protocol I am able to do it.

Can someone show me some pointers please.

thanks

A: 

plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) add this property in the nutch-site.xml file

Sunil