nutch

How to develop Nutch for better Arabic searching technology?

I am a Computer Science student and working on a project based on the Nutch search engine. I want to develop Java algorithms to better index and search Arabic websites. How can I optimize for this purpose, any ideas? ...

Nutch versus Solr

Currently collecting information where I should use Nutch/Solr/Nutch with Solr (domain - vertical web search). Could you suggest me? ...

crawl websites out of java web application without using bin/nutch

hi :) i am trying to using nutch (1.1) without bin/nutch from my (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no examples how i can realize this :/ ... i get an exception and the job fails :/ (i think of cause something with hadoop)... here is my code: public void run() throws Exception { ...

Can't search in a certain field using solR

Hi, I'm setting up an environment using Nutch 1.0 + solR 1.4. In Nutch I configured the subcollection plugin which seems to work nicely. If I search as normal adding fl=* I can see the subcollection field is filled as intented. (something like <str name="subcollection">mysite.com</str>). My problem is, I would like to be able to sear...

which Distribution of Linux is best suited for Nutch-Hadoop?

Hi experts, we are Trying to figure out which Distribution of Linux be best suited for the Nutch-Hadoop Integration?. we are planning to Use Clusters for Crawling large contents through Nutch. Let me Know if You need more clarification on this question?. Thanks you. ...

Nutch crawling with seeds urls are in range

Some site have url pattern as www.___.com/id=1 to www.___.com/id=1000. How can I crawl the site using nutch. Is there any wway to provide seed for fetching in range?? ...

Configure HTTP Post data input to Nutch before crawling a site

I have to crawl a site which list item based on user input through http post submission. How to configure post http submission details in Nutch. I got help on how to do HttpPostAuthentication, but I got no help on "how to do post data submit other than username and password". ...

Spell Checker in Nutch 1.0

Can anyone tell me how to implement spell checker in nutch 1.0? ...

Getting nutch to prioritize frequently updated pages?

Is there a way to get Nutch to increase the crawling of pages that gets updated frequently? E.g. index pages and feeds. It would also be of value to refresh fresh pages that contains comments more frequently the first date after the page was created. Any tips are appreciated. ...

nutch + mysql integration

Hi, When nutch finishes its cycle (that is crawl - fetch- parse - index) during index phase, I do not want nutch to index (lucene index), but I want nutch to place all the crawled data (I believe he keeps them as NutchDocument object) into mysql using my code. Is there any way to do this? Thanks ...

how to get the images in Nutch results?

hi, how to get the images in Nutch results? can you please explain it is possible with images? or there is any other open search engine which is producing the results with images? Thanks, Murali ...

nutch 1.1 schema.xml

Hi, I recently downloaded latest version of nutch. (nutch-1.1) While going through its code, I noticed that there is a conf/schema.xml file which defines schema for solr part bundled with nutch. This schema.xml has fields for every plugin. My question is, How do I find out, what values a particular plugin is retuning? In other words, ...

How to Index Only Pages with Certain Urls with Nutch?

Hi, I want nutch to crawl abc.com, but I want to index only car.abc.com. car.abc.com links can in any levels in abc.com. So, basically, I want nutch to keep crawl abc.com normally, but index only pages that start as car.abc.com. e.g. car.abc.com/toyota...car.abc.com/honda... I set the regex-urlfilter.txt to include only car.abc....

Directed crawl using Nutch or Heritrix

Hi, I have seen Nutch and Heritrix way of crawling. They both have the concept of generate/fetch/update cycles which start with some seed urls and iterate over the result urls after fetching step. The scoping/filtering logic works on regular expression applied to the URLs extracted. I want to do something very specific. I don't want to...

Give comparision of Nutch Vs Heritrix

Hi, I want to select one of the above for building a crawling framework for specific web sites. This is not an internet-wide crawl. I am not building a search index, and rather interested in scraping specific pages from the web site. Could somebody please detail about the pros and cons of above? Thanks Nayn ...

Building vertical crawler using Bixo

Hi, I came across an an open source crawler Bixo. Has anyone tried it? Could you please share the learning? Could we build directed crawler with enough ease (compared to Nutch/Heritrix) ? Thanks Nayn ...

What jars from Nutch do i need to write my own Crawl.java

Hi, I am trying to write my own version of Crawl.java from Nutch where I'd do a little different stuff. I don't want to work with Nutch source code. I just want to cleanly import a few jars and get going with my application. How should i provide conf/crawl-urlfilter.txt and other required conf files? Could someone help me here? Thanks ...

how to add "did you mean" in nutch-lucene search engine

i am having problem of implementing this suggestion to my bangla search engine. could anyone kindly help me out? ...

Bypassing authentication for localhost in order to implement search in Etherpad

I'm trying to implement Nutch + Solr based search engine into my Etherpad installation. The main issue I'm having is that Nutch doesn't support POST authentication. Etherpad and Nutch are installed on the same machine, so an obvious solution would be to find a way to bypass authentication for localhost. This is where I'm stuck. I don't ...

SOLR AND+OR Query - how to do?

In Nutch I'm using Solr as a search server. I would like to perform something query like (hillary AND clinton) OR (barack AND obama) OR (..) How to do it? For me single OR query works, like india OR pakistan OR china query.AddNotRequiredTerm(term); single AND query works india AND paksitan AND China query.AddRequiredTerm(term); But...