nutch

Best web graph crawler for speed?

For the past month I've been using Scrapy for a web crawling project I've begun. This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pa...

Architecture with 3 servers for solr search engine

I'm going to build a search engine on solr, and nutch as a crawler. I have to index about 13mln documents. I have 3 servers for this job: 4 core Xeon 3Ghz, 20Gb ram, 1.5Tb sata 2*4 core Xeon 3Ghz, 16Gb ram, 500Gb ide 2*4 core Xeon 3Ghz, 16Gb ram, 500Gb ide One of the servers I can use as a master for crawling and indexing, other twos...

Identifying strings in documents, with nutch+solr?

Hi, I'm looking into a search solution that will identify strings (company names) and use these strings for search and facets in Solr. I'm new to Nutch and Solr so I wonder if this is best done in Nutch or in Solr. One solution would be to generate a Parser in Nutch that identifies the strings in question and then index the name of the...

Nutch : get current crawl depth in the plugin

Hi, I want to write my own HTML parser plugin for nutch. I am doing focused crawling by generating outlinks falling only in specific xpath. In my use case, I want to fetch different data from the html pages depending on the current depth of the crawl. So I need to know the current depth in HtmlParser plugin for each content that I am par...

MapReduce Nutch tutorials

Hi, Could some one give me pointers to tutorials that explains how to write a mapreduce program into Nutch? Thank you. ...

Nutch Custom Url Partitioner

Hi, I am writing custom search task using nutch for intranet crawl. I am using Hadoop for it. I want to spawn the task across multiple hadoop slaves by dividing the seed urls evenly. I guess this job is taken care by the partioner. I see the default implementation of Nutch UrlPartitioner partitions url by Host, Domain or IP. I want to o...

solrindex way of mapping nutch schema to solr

Hi, We have several custom nutch fields that the crawler picks up and indexes. Transferring this to solr via solrindex (using the mapping file) works fine. The log shows everything is fine, however the index in solr environment does not reflect this. Any help will be much appreciated, Thanks, Ashok ...

problem by integration of apache nutch (release 1.2) in apach solr (trunk) - got solr exception

hi.. I have configured the solrindex-mapping.xml (nutch) and configured my solr schema.xml and solrconfig.xml too. Both working well on single run, but if I use the bin/nutch solrindex ... I get an exception: org.apache.solr.common.SolrException: Document [null] missing required field: id I have configured the id in all config-files. ...

no segments* file found

Hi, I need to access a lucene index ( created by crawling several webpages using Nutch) but it is giving the error shown above : java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.FSDirectory@/home/<path>: files: at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:516) ...

nutch crawler relative urls problem

Has any one experience a problem with the way the standard html parser plugin handles relative urls? There is a site - http://xxxx/asp/list_books.asp?id_f=11327 and when browsing a link with its href set to '?id_r=442&id=41&order=' a browser will naturally take you to http://xxxx/asp/list_books.asp?id_r=442&amp;id=41&amp;order= However,...

nutch crawler - how to set maximum number of inlinks per host

How can i set maximum number of pages to index per host? i don't want to index all million pages of site, i want to index only first 100000 found pages. ...

nutch and sitemap.xml

does apache-nutch support sitemaps? or how can i implement it myself? how can i use priority field, should it be multiplied to boost field? ...

How do you crawl external links on a found page?

I used the example on installing nutch from their wiki. I was able to crawl multiple pages pulled from dmoz easily. But is there a configuration that can be done to crawl external links it finds on a page, or write those external links to a file to be crawled next? What is the best way to follow links on a page to index that page as wel...