views:

1856

answers:

6

I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper.

What I really need is something that I can give a site url to & it will follow every link and store the content for indexing.

A: 

Sphider is pretty good. It's PHP, but it might be of some help.

Darryl Hein
+4  A: 

HTTrack -- http://www.httrack.com/ -- is a very good Website copier. Works pretty good. Have been using it for a long time.

Nutch is a web crawler(crawler is the type of program you're looking for) -- http://lucene.apache.org/nutch/ -- which uses a top notch search utility lucene.

anjanb
A: 

I haven't used this yet, but this looks interesting. The author wrote it from scratch and posted how he did. The code for it is available for download as well.

ranomore
+1  A: 

Searcharoo.NET contains a spider that crawls and indexes content, and a search engine to use it. You should be able to find your way around the Searcharoo.Indexer.EXE code to trap the content as it's downloaded, and add your own custom code from there...

It's very basic (all the source code is included, and is explained in six CodeProject articles, the most recent of which is here Searcharoo v6): the spider follows links, imagemaps, images, obeys ROBOTS directives, parses some non-HTML file types. It is intended for single websites (not the entire web).

Nutch/Lucene is almost certainly a more robust/commercial-grade solution - but I have not looked at their code. Not sure what you are wanting to accomplish, but have you also seen Microsoft Search Server Express?

Disclaimer: I am the author of Searcharoo; just offering it here as an option.

CraigD
+1  A: 

arachnode.net is an open source Web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server 2008.

http://arachnode.net

arachnode dot net
A: 

I use Mozenda's Web Scraping software. You could easily have it crawl all of the links and grab all of the information you need and it's a great software for the money.

Amber