hi i am interested to do web crawling.i was working on solr.so,solr do web crawling or what are the steps to do web crawling?
Solr does not do web crawling, it's a search server that provides full text search capabilities. It builds on top of Lucene.
If you need to crawl web pages then you have a number of options including:
- Nutch - http://lucene.apache.org/nutch/
- Websphinx - http://www.cs.cmu.edu/~rcm/websphinx/
- JSpider - http://j-spider.sourceforge.net/
- Heritrix - http://crawler.archive.org/
If you want to make use of the search facilities provided by Lucene or SOLR you'll need to build indexes from the web crawl results.
See this also:
http://stackoverflow.com/questions/1580882/lucene-crawler-it-needs-to-build-lucene-index
Def Nutch ! Nutch also has a basic web front end which will let you query your search results. You might not even need to bother with SOLR depending on your requirements. If you do a Nutch/SOLR combination you should be able to take advantage of the recent work done to integrate SOLR and Nutch ... http://issues.apache.org/jira/browse/NUTCH-442