nutch

Anyone has worked with a PHP API to read 'Nutch search engine' crawl results?

I have set up 'Nutch search engine' to crawl websites. Now,I need to write an php API to talk to the Nutch search engine. I need to do 2 things: 1.using a PHP script I need to specify to Nutch as to which urls to crawl (for this I have some pointers from http://www.cs.sjsu.edu/faculty/pollett/masters/Semesters/Fall07/sheetal/?Deliv...

Nutch issues with crwaling website where the url differes only in termes of parameters passes

Hi, I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/) and one other. The urls on my webiste are basically of this format http://mysite.com/index.php?main%5Fpage=index&params=12 http://mysite.com/index.php?main%5Fpage=index&cat...

How to enable follow Redirect in Nutch-1.0

Hello I am using Nutch-1.0 and I am getting this log entry 2009-11-12 22:13:11,093 INFO httpclient.HttpMethodDirector - Redirect requested but followRedirects is disabled. How to enable Follow Redirect. Thanks in advance.. ...

configuring nutch regex-normalize.xml

I am using the Java-based Nutch web-search software. In order to prevent duplicate (url) results from being returned in my search query results, I am trying to remove (a.k.a. normalize) the expressions of 'jsessionid' from the urls being indexed when running the Nutch crawler to index my intranet. However my modifications to $NUTCH_HOME/...

Problem with running the Nutch command from PHP exec()

My Nutch directory lies in /home/myserv/nutch/nutch-1.0/ My php applictaion is in the diretcory /home/myserv/www/ Theres a a php file in my /home/myserv/www/ diretcory that runs a exec command to run a nutch command.PHP code is like : $output = exec("bin/nutch all"); When I run the command from the command line I need to be in th...

Does Nutch automatically crawl my site when new pages are added?

Does Nutch crawl automatically when I add new pages to the website? ...

crawler get external website search result

What is the best practice and library I can use to key in search textbox on external website and collect the search result? How do tackle website with different search box and checkbox and collect the result? Can Selenium be used to automate this? Should I use Heritrix or nutch? Which one is better? I heard nutch comes with plugins. Whi...

Crawling engine architecture - Java/ Perl integration

Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can...

how to create a custom field in nuch search engine?

hi i want to create a custom field in nutch search engine? what are the steps i will follow? ...

Does any open, simply extendible web crawler exists?

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them: partly just to read the feeds of several sites to scrap the content of these sites if the site has an archive I would like to crawl and index it as we...

Which is the best open source search engine

Which is the best open source search engine ..to process more than 5 billion of datas.. 1) NUtch 2) Solr ...

Inject and index a single url with Nutch

Hello; I want to inject a single url to the crawldb as a string not a urlDir, I'm thinking in add a modified method of the Injector.inject that take the url as a string parameter, but I cant inject the string url in the crawldb; I guess the current injector using the fileInput.. from hadoop. how can I do this ? and I test to crawl url...

posting nutch data into a BASIC auth secured Solr instance

Hi. I've secured a solr instance using BASIC auth, kind of how it is shown here: http://blog.comtaste.com/2009/02/securing_your_solr_server_on_t.html Now i'm trying to update my batch processes to push data into the authenticated instance. The ones using "curl" are easy, but i also have a Nutch crawl that uses the "solrindex" command to...

Crawling site, but only indexing certain pages with Nutch

On some of the sites I want to index with Nutch there are only specific types of pages I would like to be searchable. I need a way to be able to crawl these sites, but only index pages that match a certain regular expression. ex: www.example.com/browse/ finds links in the form of www.example.com/items/1234.html and www.example.com/item...

solr admin gives 404 errors after integrating nutch

I've followed the instructions from http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ Had solr up and running before that, could handle test cases, access admin pages, etc. Copied the nutch schema.xml over to solr as per instructions. Worked, could access admin. When I added in the requesthandler snippet (see 5d on the websi...

Number of connections to the host at the same time

How can I handle this? ...

nutch spell checker | nutch navigation filter

Hello, i am try to configure the nutch 1.0 search engine. First i need to integrate a spell checker or somthing like this, is there a plugin available? My next question is, how to rule out html tag like "", so that navigation is not a part of the index? thanks for all answers ...

looking for nutch alternative

i am looking for a open source full featured web search engine like nutch , because nutch is complex and it take much time to penetrate its code and i didnt find a book about it . ...

Nutch - how to crawl by small patches?

Hi everyone! I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do: Start to crawl my seeds with possibility to go further on outlinks. Crawl 20000 pages, then index them. C...

how to parse (only text) web sites while crawling

i can succesfully run crawl command via cygwin on windows xp. and i can also make web search via using tomcat. but i also want to save parsed pages during crawling event so when i start crawling with like this bin/nutch crawl urls -dir crawled -depth 3 i also want save parsed html files to text files i mean during this period which ...