How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol | ansaurus

tags:

views:

359

answers:

2

Q:

How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

Hi Everyone, I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

I am able to do it on local file systems using file:// protocol but not http protocol

A:

plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) add this property in the nutch-site.xml file then you will crawl the pdf files

Sunil 2009-10-29 08:44:09

A:

Thanks Sunil for your reply,

But this does not help me. In the log I keep getting Generator: 0 records selected for fetching, exiting ... 2009-10-30 10:38:46,574 WARN crawl.Generator (Generator.java:generate(493)) - Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. 2009-10-30 10:38:46,574 INFO crawl.Crawl (Crawl.java:main(119)) - Stopping at depth=1 - no more URLs to fetch.

so after the fetch, I see that all the files are 1KB size and in Indexes I only find 6 files with 5 -- 1KB size and 1 - 0KB

It is not complete.

I am missing something. If anyone can try the URL I am using which is http://www.ontla.on.ca/library/repository/ser/140213/ and successful , please let me know.

Thanks everyone

Pramila 2009-10-30 14:51:26

related questions

Web crawlers and Google App Engine Hosted applications

Detecting CacheBuster querystrings when crawling a page

How to prevent robots.txt passing from staging env to production?

Detecting honest web crawlers

How to force a page to be removed from the search engine index?

How to best develop web crawlers

robots.txt: disallow all but a select few, why not?

Asp.net Request.Browser.Crawler - Dynamic Crawler List?

What are the best prebuilt libraries for doing Web Crawling in Python

Anyone know of a good Python based web crawler that I could use?

Web crawler links/page logic in PHP

Crawler/parser for Xapian

Protect Email on Web Site From Robots and Crawlers

Recommendations for a spidering tool to use with Lucene or Solr?

Detecting 'stealth' web-crawlers

Can I block search crawlers for every site on an Apache web server?

HttpBrowserCapabilities.Crawler property .NET

Prevent site data from being crawled and ripped

What's a good Web Crawler tool

Building a web crawler - using Webkit packages

Is there a .NET equivalent of Perl's LWP / WWW::Mechanize?

How do you turn a dynamic site into a static site that can be demo'd from a CD?

keep rsync from removing unfinished source files

How to set up a robot.txt which only allows the default page of a site

What are the key considerations when creating a web crawler?