I have set up 'Nutch search engine' to crawl websites.
Now,I need to write an php API to talk to the Nutch search engine.
I need to do 2 things:
1.using a PHP script I need to specify to Nutch as to which urls to crawl
(for this I have some pointers from http://www.cs.sjsu.edu/faculty/pollett/masters/Semesters/Fall07/sheetal/?Deliv...
Hi,
I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/) and one other.
The urls on my webiste are basically of this format
http://mysite.com/index.php?main%5Fpage=index&params=12
http://mysite.com/index.php?main%5Fpage=index&cat...
Hello
I am using Nutch-1.0 and I am getting this log entry
2009-11-12 22:13:11,093 INFO httpclient.HttpMethodDirector - Redirect requested but followRedirects is disabled.
How to enable Follow Redirect.
Thanks in advance..
...
I am using the Java-based Nutch web-search software. In order to prevent duplicate (url) results from being returned in my search query results, I am trying to remove (a.k.a. normalize) the expressions of 'jsessionid' from the urls being indexed when running the Nutch crawler to index my intranet. However my modifications to $NUTCH_HOME/...
My Nutch directory lies in /home/myserv/nutch/nutch-1.0/
My php applictaion is in the diretcory /home/myserv/www/
Theres a a php file in my /home/myserv/www/ diretcory that runs a exec command to run a nutch command.PHP code is like :
$output = exec("bin/nutch all");
When I run the command from the command line I need to be in th...
Does Nutch crawl automatically when I add new pages to the website?
...
What is the best practice and library I can use to key in search textbox on external website and collect the search result?
How do tackle website with different search box and checkbox and collect the result?
Can Selenium be used to automate this?
Should I use Heritrix or nutch? Which one is better? I heard nutch comes with plugins. Whi...
Hi all,
I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can...
hi
i want to create a custom field in nutch search engine?
what are the steps i will follow?
...
I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:
partly just to read the feeds of several sites
to scrap the content of these sites
if the site has an archive I would like to crawl and index it as we...
Which is the best open source search engine ..to process more than 5 billion of datas..
1) NUtch
2) Solr
...
Hello;
I want to inject a single url to the crawldb as a string not a urlDir,
I'm thinking in add a modified method of the Injector.inject that take the url as a string parameter, but I cant inject the string url in the crawldb; I guess the current injector using the fileInput.. from hadoop.
how can I do this ?
and I test to crawl url...
Hi. I've secured a solr instance using BASIC auth, kind of how it is shown here:
http://blog.comtaste.com/2009/02/securing_your_solr_server_on_t.html
Now i'm trying to update my batch processes to push data into the authenticated instance. The ones using "curl" are easy, but i also have a Nutch crawl that uses the "solrindex" command to...
On some of the sites I want to index with Nutch there are only specific types of pages I would like to be searchable. I need a way to be able to crawl these sites, but only index pages that match a certain regular expression.
ex:
www.example.com/browse/ finds links in the form of www.example.com/items/1234.html and www.example.com/item...
I've followed the instructions from http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
Had solr up and running before that, could handle test cases, access admin pages, etc.
Copied the nutch schema.xml over to solr as per instructions. Worked, could access admin.
When I added in the requesthandler snippet (see 5d on the websi...
How can I handle this?
...
Hello,
i am try to configure the nutch 1.0 search engine. First i need to integrate a spell checker or somthing like this, is there a plugin available?
My next question is, how to rule out html tag like "", so that navigation is not a part of the index?
thanks for all answers
...
i am looking for a open source full featured web search engine like nutch , because nutch is complex and it take much time to penetrate its code and i didnt find a book about it .
...
Hi everyone!
I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do:
Start to crawl my seeds with
possibility to go further on
outlinks.
Crawl 20000 pages, then
index them.
C...
i can succesfully run crawl command via cygwin on windows xp. and i can also make web search via using tomcat.
but i also want to save parsed pages during crawling event
so when i start crawling with like this
bin/nutch crawl urls -dir crawled -depth 3
i also want save parsed html files to text files
i mean during this period which ...