views:

1821

answers:

6

The crawler needs to have an extendable architecture to allow changing the internal process, like implementing new steps (pre-parser, parser, etc...)

I found the Heritrix Project (http://crawler.archive.org/).

But there are other nice projects like that?

A: 

I've discovered recently one called - Nutch.

Artem Barger
A: 

If you're not tied down to platform, I've had very good experiences with Nutch in the past.

It's written in Java and goes hand in hand with the Lucene indexer.

Justin Niessner
+3  A: 

Nutch is the best you can do when it comes to a free crawler. It is built off of the concept of Lucene (in an enterprise scaled manner) and is supported by the Hadoop back end using MapReduce (similar to Google) for large scale data querying. Great products! I am currently reading all about Hadoop in the new (not yet released) Hadoop in Action from manning. If you go this route I suggest getting onto their technical review team to get an early copy of this title!

These are all Java based. If you are a .net guy (like me!!) then you might be more interested in Lucene.NET, Nutch.NET, and Hadoop.NET which are all class by class and api by api ports to C#.

Andrew Siemer
+1 for Nutch and Hadoop, you can also look at solr if you are looking for distributed and scalable solution.
Sumit Ghosh
+1  A: 

http://arachnode.net 1.2 release +lucene.net

What is arachnode.net? arachnode.net is an open source Web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server 2008.

(not that I'm biased or anything... :))

arachnode dot net
lol you should be... its a great product :D
baeltazor
A: 

A web spider, some times called a crawler or a robot, plays an important role as an essential infrastructure of every search engines. It automatically discovers and collects resources, especially the web pages, from the Internet. As the rapidly growth of the Internet, a typical design of web spider may not cope with the overwhelming number of web pages. Here is a nice article on this

irshad
A: 

You can always try 80legs, which offers free web crawling or more powerful web crawling options. It's not open source, but is very customizable, with plugin-style 80apps and a web crawling API.

shiondev