tags:

views:

66

answers:

1

Hi, I am trying to write my own version of Crawl.java from Nutch where I'd do a little different stuff. I don't want to work with Nutch source code. I just want to cleanly import a few jars and get going with my application. How should i provide conf/crawl-urlfilter.txt and other required conf files?

Could someone help me here? Thanks

+1  A: 

One simple way is to package your code in a jar. Be sure to include a main in one of the class that starts your crawling. Drop that jar file in the lib folder of your Nutch installation. You can now start your crawling with a command like (assuming that your PATH is correctly set to find the nutch command):

nutch com.xyz.YourCrawlerMain

where "com.xyz.YourCrawlerMain" represents your main class to launch your crawling.

This will launch your crawler with the Nutch classpath correctly set.

For the configuration files, just update them directly in the conf folder of your Nutch installation.

UPDATE

I'm working on something similar and I am able to make nutch work from my app with these settings: set your classpath to include the Nutch folder (so it can find the plugins), the Nutch/conf folder and include all jars from Nutch/lib + nutch.jar from the nutch folder.

But beware if your app is running in a web container. I had to mess with the classpath to make it works...

Pascal Dimassimo
Nutch is external to my application. I am not trying to run nutch with my crawl command. I do not wish to write a full fledged crawler-indexer. I just wanted to use individual nutch components to crawl a particular web site and scrape content of my interest.This way I just have jar dependency for individual nutch components and it's plugins. This compiles but somehow fails to run on eclipse with below error:java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
Nayn
ok, see my updates
Pascal Dimassimo
Hi Pascal, Sorry to ask you again on this but i still could not get it to work. Would it be possible for you to share your eclipse workspace (just simple nutch crawl demo) so that i could get some idea where i'm missing. My mail id is nayanish[dot]hinge[at]gmail.com
Nayn
I got it to work. It was a minor issue. I had to create plugins folder and add all plugins jar and update nutch-site.xml for the locaiton
Nayn
glad it worked!
Pascal Dimassimo