nutch

Java Lucene integration with .Net

I've got nutch and lucene setup to crawl and index some sites and I'd like to use a .net website instead of the JSP site that comes with nutch. Can anyone recommend some solutions? I've seen solutions where there was an app running on the index server which the .Net site used remoting to connect to. Speed is a consideration obviously ...

Using Nutch crawler with Solr

Am I able to integrate Apache Nutch crawler with the Solr Index server? Edit: One of our devs came up with a solution from these posts Running Nutch and Solr Update for Running Nutch and Solr Answer Yes ...

How do we create a simple search engine using Lucene, Solr or Nutch?

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document links of all matching PDF's. ...

Performance Benchmarking for Apache Nutch

I want to know if there are any existing benchmarks and sizing information for an apache nutch based search engine deployment. I want to know for say 10 million searches a month what should be the hardware sizing that needs to deployed. ...

Apache Nutch on Windows

Has anyone tryed to install Nutch on Windows? I'm following this installation guide: http://zillionics.com/resources/articles/NutchGuideForDummies.htm After a few bumps I'm stuck trying to run the crawler. It gives me this error: bin/nutch: line 15: syntax error near unexpected token '$'in\r'' 'in/nutch: line 15: 'case "'uname'" in A...

Problem running Java .war on Tomcat

I am following the tutorial here: http://nutch.sourceforge.net/docs/en/tutorial.html Crawling works fine, as does the test search from the command line. When I try to fire up tomcat after moving ROOT.war into place(and it unarchiving and creating a new ROOT folder during startup), I get a page with the 500 error and some errors in...

What is the best way to freshen a Nutch index?

I haven't looked at Nutch for a year or so and it looks like it has changed significantly. The documentation on re-crawling isn't clear. What is the best way to update an existing Nutch index? ...

Parsing html data with nutch 1.0 and a custom plugin

I am currently trying to write a custom plugin for nutch 1.0. This plugin is supposed to parse html data and filter out relevant information from documents. I have a basic plugin working, it extends the HtmlParserResult object and is executed each time I do a parse. My problems are two faced at the moment: I do not understand the wor...

How is an aggregator built?

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that? Have a spider/crawler who will crawl the web for finding the information I need (how would I tell the crawler what to crawl because I don't want to get the whole web?)? Then have an ind...

how to do OR search in nutch?

Say,search for results whose Field is 'A' or 'B'? it seems the default is AND. ...

how to make nutch crawl file system?

not based on http, like http://localhost:81 and so on, but directly crawl a certain directory on local file system, is there any way out? ...

Nutch search always returns 0 results

I have set up nutch 1.0 on a cluster. It has been setup and has successfully crawled, I copied the crawl directory using the dfs -copyToLocal and set the value of searcher.dir in the nutch-site.xml file located in the tomcat directory to point to that directory. Still when I try to search I receive 0 results. Any help would be greatly a...

Nutch Multithreading

hi, Iam trying to configure nutch for running multi-threaded crawling. However , Iam facing an issue. I am not able to run crawl with multiple threads , I have modified the nutch-site.xml to use 25 threads but still I can see only 1 Threads running. <property> <name>fetcher.threads.fetch</name> <value>25</value> <description>Th...

nutch field problem

I was using something like: Field notdirectory = new Field("notdirectory","1", Field.Store.NO, Field.Index.UN_TOKENIZED); and queries like "notdirectory:1" can be processed quite well all the time. But recently I've changed the "Field.Store.NO, Field.Index.UN_TOKENIZED" to index a non-numeric string: Field stateField = new Field("st...

rss feeds in nutch

hi.. Actually i ma newbie to nutch. i want to khnow is there any way we crawl a rss feed then customize the parse data so that index can hv different fields from rss. like Suppose the rss feed hav a field source in item. i want to index this field.. thanxx vibs ...

Nutch plugin development

The nutch wiki has instructions on how to build nutch plugins, but only if you download the entire nutch source tree and put it in there, below $NUTCH_HOME/src/plugin. I don't want my source code mixed in their subversion tree, I want it in my src/com/xcski git repository. And I shouldn't have to download the source code for nutch just...

Why doesn't Nutch seem to know about "Last-Modified"?

I setup Nutch with a db.fetch.interval.default of 60000 so that I can crawl every day. If I don't, it won't even look at my site when I crawl the next day. But when I do crawl the next day, every page that it fetched yesterday gets fetched with a 200 response code, indicating that it's not using the previous day's date in the "If-Modif...

how nutch plugins work?

I am new to nutch, but i know nutch uses Lucene for indexing,which only understands text format. Nutch have many plug-ins that can is used for crawling the particular format that plug-in meant for. my doubt is how actually the nutch plug-in works?. I seen the Team wiki page for nutch i want some information like how actually nutch w...

Crawling Files using http protocol

Hi , I have a question about crawling the files that are accessable via http. I am talking about pdf files. I am not able to do it using Nutch 1.0. the protocol I am using is similar to this http://www.ontla.on.ca/library/repository/ser/140213/2006/ but I do not see any data fetched. the files generated are 1kb. But on Local file sys...

How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

Hi Everyone, I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol I am able to do it on local file systems using file:// protocol but not http protocol ...