views:

2227

answers:

9

Hi All,

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content.

Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc.

Thanks! -Jason

A: 

Use wget, do a recursive web suck, which will dump all the files onto your harddrive, then write another script to go through all the downloaded files and analyze them.

Edit: or maybe curl instead of wget, but I am not familiar with curl, I do not know if it does recursive downloads like wget.

whatsisname
A: 

You could make a list of words and make a thread for each word searched at google.
Then each thread will create a new thread for each link it find in the page.
Each thread should write what it finds in a database. When each thread finishes reading the page, it terminates.
And there you have a very big database of links in your database.

Gero
+1  A: 

Crawlers are simple in concept.

You get a root page via a HTTP GET, parse it to find URLs and put them on a queue unless they've been parsed already (so you need a global record of pages you have already parsed).

You can use the Content-type header to find out what the type of content is, and limit your crawler to only parsing the HTML types.

You can strip out the HTML tags to get the plain text, which you can do text analysis on (to get tags, etc, the meat of the page). You could even do that on the alt/title tags for images if you got that advanced.

And in the background you can have a pool of threads eating URLs from the Queue and doing the same. You want to limit the number of threads of course.

JeeBee
+1  A: 

Wikipedia has a good article about web crawlers, covering many of the algorithms and considerations.

However, I wouldn't bother writing my own crawler. It's a lot of work, and since you only need a "simple crawler", I'm thinking all you really need is an off-the-shelf crawler. There are a lot of free and open-source crawlers that will likely do everything you need, with very little work on your part.

Derek Park
+3  A: 

If your NPO's sites are relatively big or complex (having dynamic pages that'll effectively create a 'black hole' like a calendar with a 'next day' link) you'd be better using a real web crawler, like Heritrix.

If the sites total a few number of pages you can get away with just using curl or wget or your own. Just remember if they start to get big or you start making your script more complex to just use a real crawler or at least look at its source to see what are they doing and why.

Some issues (there are more):

  • Black holes (as described)
  • Retries (what if you get a 500?)
  • Redirects
  • Flow control (else you can be a burden on the sites)
  • robots.txt implementation
Vinko Vrsalovic
Can you please provide some insight into dealing with the issues you mention? In particular, black holes?
Shabbyrobe
The usual way out of black holes is programming a configurable limit for each domain or regex matching URL (ie, if URL matches this or domain is that, move on after 1000 retrieved matching pages). Flow control is implemented typically in pages per second per domain (usually they make you wait more than one second as to avoid being a burden).
Vinko Vrsalovic
+23  A: 

You'll be reinventing the wheel, to be sure. But here's the basics:

  • A list of unvisited URLs - seed this with one or more starting pages
  • A list of visited URLs - so you don't go around in circles
  • A set of rules for URLs you're not interested in - so you don't index the whole Internet

Put these in persistent storage, so you can stop and start the crawler without losing state.

Algorithm is:

while(list of unvisited URLs is not empty) {
    take URL from list
    fetch content
    record whatever it is you want to about the content
    if content is HTML {
        parse out URLs from links
        foreach URL {
           if it matches your rules
              and it's not already in either the visited or unvisited list
              add it to the unvisited list
        }
    }
}
slim
+1 Clear, nice, general answer.
gorsky
Great answer, but when you say re-inventing the wheel, where exactly are the free open source web crawler frameworks? possibly for java but i haven't found any for .net.
Anonymous Type
http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers
slim
Ugh, hit enter too soon. That link has a good few, none of which is .Net. However, I don't really understand why you'd choose to restrict yourself to .Net.
slim
A: 

arachnode.net is an open source Web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server 2008.

(I answer any and all questions that those wanting to write a crawler may have... :))

http://arachnode.net

arachnode dot net
A: 

Take a look at this post:

Java web crawler searcher robot that sends e-mail

It provides you a quick view on how to crawl a website and look for information.

Leniel Macaferi
A: 

A .NET web crawler using HTMLAgilitypack: http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/

Hightechrider