Anybody knows a good extendable open source web-crawler?

views:

1821

answers:

+2 Q:

Anybody knows a good extendable open source web-crawler?

The crawler needs to have an extendable architecture to allow changing the internal process, like implementing new steps (pre-parser, parser, etc...)

I found the Heritrix Project (http://crawler.archive.org/).

But there are other nice projects like that?

I've discovered recently one called - Nutch.

Artem Barger 2009-06-24 17:32:03

If you're not tied down to platform, I've had very good experiences with Nutch in the past.

It's written in Java and goes hand in hand with the Lucene indexer.

Justin Niessner 2009-06-24 17:32:56

+3 A:

Nutch is the best you can do when it comes to a free crawler. It is built off of the concept of Lucene (in an enterprise scaled manner) and is supported by the Hadoop back end using MapReduce (similar to Google) for large scale data querying. Great products! I am currently reading all about Hadoop in the new (not yet released) Hadoop in Action from manning. If you go this route I suggest getting onto their technical review team to get an early copy of this title!

These are all Java based. If you are a .net guy (like me!!) then you might be more interested in Lucene.NET, Nutch.NET, and Hadoop.NET which are all class by class and api by api ports to C#.

Andrew Siemer 2009-06-24 18:00:01

+1 for Nutch and Hadoop, you can also look at solr if you are looking for distributed and scalable solution.

Sumit Ghosh 2010-08-12 12:06:55

+1 A:

http://arachnode.net 1.2 release +lucene.net

What is arachnode.net? arachnode.net is an open source Web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server 2008.

(not that I'm biased or anything... :))

arachnode dot net 2009-08-29 09:18:42

lol you should be... its a great product :D

baeltazor 2009-08-29 09:21:24

A web spider, some times called a crawler or a robot, plays an important role as an essential infrastructure of every search engines. It automatically discovers and collects resources, especially the web pages, from the Internet. As the rapidly growth of the Internet, a typical design of web spider may not cope with the overwhelming number of web pages. Here is a nice article on this

irshad 2010-08-27 21:20:38

You can always try 80legs, which offers free web crawling or more powerful web crawling options. It's not open source, but is very customizable, with plugin-style 80apps and a web crawling API.

shiondev 2010-09-09 03:21:56

ansaurus

tags:

views:

answers:

Anybody knows a good extendable open source web-crawler?

related questions