How do you crawl external links on a found page? | ansaurus

tags:

nutch

views:

28

answers:

1

Q:

How do you crawl external links on a found page?

I used the example on installing nutch from their wiki. I was able to crawl multiple pages pulled from dmoz easily. But is there a configuration that can be done to crawl external links it finds on a page, or write those external links to a file to be crawled next?

What is the best way to follow links on a page to index that page as well with nutch? If I were executing the bin/nutch via python, could I get back all the external links it found, and create a new crawl list to run again? What would you do?

+1 A:

First, make sure that the parameter 'db.ignore.external.links' is set to false. Also, in the file 'regex-urlfilter.txt', add rules for the external links you wish to be crawled OR add +. as the last rule. The +. rule will make the crawler follow ALL links. If you use that last option, beware that you risk crawling all the Web!

Pascal Dimassimo 2010-10-27 12:43:42

Thank you very much. I will play with the regex filter.txt file to get optimal results.

Hallik 2010-10-27 17:38:53

related questions

Problem with running the Nutch command from PHP exec()

configuring nutch regex-normalize.xml

How to enable follow Redirect in Nutch-1.0

Nutch issues with crwaling website where the url differes only in termes of parameters passes

Anyone has worked with a PHP API to read 'Nutch search engine' crawl results?

How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

Crawling Files using http protocol

how nutch plugins work?

Why doesn't Nutch seem to know about "Last-Modified"?

Nutch plugin development

rss feeds in nutch

nutch field problem

Nutch Multithreading

Nutch search always returns 0 results

how to make nutch crawl file system?

how to do OR search in nutch?

How is an aggregator built?

Parsing html data with nutch 1.0 and a custom plugin

What is the best way to freshen a Nutch index?

Problem running Java .war on Tomcat

Apache Nutch on Windows

Performance Benchmarking for Apache Nutch

How do we create a simple search engine using Lucene, Solr or Nutch?

Using Nutch crawler with Solr

Java Lucene integration with .Net