I am a Computer Science student and working on a project based on the Nutch search engine. I want to develop Java algorithms to better index and search Arabic websites. How can I optimize for this purpose, any ideas?
...
Currently collecting information where I should use Nutch/Solr/Nutch with Solr (domain - vertical web search). Could you suggest me?
...
hi :)
i am trying to using nutch (1.1) without bin/nutch from my (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no examples how i can realize this :/ ... i get an exception and the job fails :/ (i think of cause something with hadoop)... here is my code:
public void run() throws Exception {
...
Hi,
I'm setting up an environment using Nutch 1.0 + solR 1.4.
In Nutch I configured the subcollection plugin which seems to work nicely. If I search as normal adding fl=* I can see the subcollection field is filled as intented. (something like <str name="subcollection">mysite.com</str>).
My problem is, I would like to be able to sear...
Hi experts,
we are Trying to figure out which Distribution of Linux be best suited for the Nutch-Hadoop Integration?.
we are planning to Use Clusters for Crawling large contents through Nutch.
Let me Know if You need more clarification on this question?.
Thanks you.
...
Some site have url pattern as www.___.com/id=1 to www.___.com/id=1000. How can I crawl the site using nutch. Is there any wway to provide seed for fetching in range??
...
I have to crawl a site which list item based on user input through http post submission. How to configure post http submission details in Nutch.
I got help on how to do HttpPostAuthentication, but I got no help on "how to do post data submit other than username and password".
...
Can anyone tell me how to implement spell checker in nutch 1.0?
...
Is there a way to get Nutch to increase the crawling of pages that gets updated frequently?
E.g. index pages and feeds.
It would also be of value to refresh fresh pages that contains comments more frequently the first date after the page was created. Any tips are appreciated.
...
Hi,
When nutch finishes its cycle (that is crawl - fetch- parse - index) during index phase, I do not want nutch to index (lucene index), but I want nutch to place all the crawled data (I believe he keeps them as NutchDocument object) into mysql using my code.
Is there any way to do this?
Thanks
...
hi,
how to get the images in Nutch results?
can you please explain it is possible with images? or there is any other open search engine which is producing the results with images?
Thanks,
Murali
...
Hi,
I recently downloaded latest version of nutch. (nutch-1.1) While going through its code, I noticed that there is a conf/schema.xml file which defines schema for solr part bundled with nutch.
This schema.xml has fields for every plugin.
My question is, How do I find out, what values a particular plugin is retuning? In other words, ...
Hi,
I want nutch to crawl abc.com, but I want to index only car.abc.com. car.abc.com links can in any levels in abc.com. So, basically, I want nutch to keep crawl abc.com normally, but index only pages that start as car.abc.com. e.g. car.abc.com/toyota...car.abc.com/honda...
I set the regex-urlfilter.txt to include only car.abc....
Hi,
I have seen Nutch and Heritrix way of crawling. They both have the concept of generate/fetch/update cycles which start with some seed urls and iterate over the result urls after fetching step.
The scoping/filtering logic works on regular expression applied to the URLs extracted.
I want to do something very specific.
I don't want to...
Hi,
I want to select one of the above for building a crawling framework for specific web sites. This is not an internet-wide crawl. I am not building a search index, and rather interested in scraping specific pages from the web site.
Could somebody please detail about the pros and cons of above?
Thanks
Nayn
...
Hi,
I came across an an open source crawler Bixo.
Has anyone tried it? Could you please share the learning? Could we build directed crawler with enough ease (compared to Nutch/Heritrix) ?
Thanks
Nayn
...
Hi,
I am trying to write my own version of Crawl.java from Nutch where I'd do a little different stuff. I don't want to work with Nutch source code. I just want to cleanly import a few jars and get going with my application. How should i provide conf/crawl-urlfilter.txt and other required conf files?
Could someone help me here?
Thanks
...
i am having problem of implementing this suggestion to my bangla search engine.
could anyone kindly help me out?
...
I'm trying to implement Nutch + Solr based search engine into my Etherpad installation. The main issue I'm having is that Nutch doesn't support POST authentication. Etherpad and Nutch are installed on the same machine, so an obvious solution would be to find a way to bypass authentication for localhost.
This is where I'm stuck. I don't ...
In Nutch I'm using Solr as a search server.
I would like to perform something query like
(hillary AND clinton) OR (barack AND obama) OR (..)
How to do it?
For me single OR query works, like
india OR pakistan OR china
query.AddNotRequiredTerm(term);
single AND query works
india AND paksitan AND China
query.AddRequiredTerm(term);
But...