Crawling Files using http protocol | ansaurus

tags:

nutch

views:

41

answers:

1

Q:

Crawling Files using http protocol

Hi ,

I have a question about crawling the files that are accessable via http. I am talking about pdf files.

I am not able to do it using Nutch 1.0. the protocol I am using is similar to this http://www.ontla.on.ca/library/repository/ser/140213/2006/

but I do not see any data fetched. the files generated are 1kb.

But on Local file system with file protocol I am able to do it.

Can someone show me some pointers please.

thanks

A:

plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) add this property in the nutch-site.xml file

Sunil 2009-10-29 08:45:25

related questions

Problem with running the Nutch command from PHP exec()

configuring nutch regex-normalize.xml

How to enable follow Redirect in Nutch-1.0

Nutch issues with crwaling website where the url differes only in termes of parameters passes

Anyone has worked with a PHP API to read 'Nutch search engine' crawl results?

How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

how nutch plugins work?

Why doesn't Nutch seem to know about "Last-Modified"?

Nutch plugin development

rss feeds in nutch

nutch field problem

Nutch Multithreading

Nutch search always returns 0 results

how to make nutch crawl file system?

how to do OR search in nutch?

How is an aggregator built?

Parsing html data with nutch 1.0 and a custom plugin

What is the best way to freshen a Nutch index?

Problem running Java .war on Tomcat

Apache Nutch on Windows

Performance Benchmarking for Apache Nutch

How do we create a simple search engine using Lucene, Solr or Nutch?

Using Nutch crawler with Solr

Java Lucene integration with .Net