views:

829

answers:

10

Hi All,

I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.

How do I implement a crawler?

I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)

Are there others?

What opinions does everyone have?

-Jason

+5  A: 

There's a good book on the subject I can recommend called Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL.

Bill the Lizard
+3  A: 

I think the webcrawler part will be the easiest part of the task. The hard part will be deciding which sites to visit and how to discover events on the sites that you want to visit. Maybe you want to see about using either the Google or Yahoo API to get the data you want. They've already done the work of crawling a lot of pages on the internet--you can focus on the, to my mind anyway, much tougher problem of sifting the data to get the events you're looking for.

Onorio Catenacci
A: 

Is there a language specific requirement ?,

I spent some time playing around with the Chilkat Spider Lib's for .net a while back for personal experimentation,

Last I checked there spider Libs, are licensed as Freeware, ( Altho not open source as far as i know :( )

Seems they have python Lib's to.

http://www.example-code.com/python/pythonspider.asp #Python http://www.example-code.com/csharp/spider.asp #.Net

Fusspawn
+2  A: 

Whatever you do, please be a good citizen and obey the robots.txt file. You might want to check the references at the wikipedia page on focused crawlers. Just realized that I know one of the authors of Topical Web Crawlers: Evaluating Adaptive Algorithms. Small world.

tvanfosson
+1  A: 

If you find that crawling the internet becomes to dawnting a task you may want to consider building an RSS aggregator and subscribing to RSS feeds for popular event sites like craigslist and upcoming.org.

Each of these sites provide localized, searchable events. RSS provides you with a (few) standardized formats instead of having to all the malformed html that makes up the web...

There are opensource libraries like ROME (java) that may help with the consumption of RSS feeds.

Kevin Williams
A: 

Following on Kevin's suggestion of RSS feeds, you might want to check out Yahoo pipes. I haven't tried them yet, but I think they allow you process several RSS feeds and generate web pages or more RSS feeds.

Don Kirkby
Never use Pipes for anything big. It is not very reliable and pretty slow.
mixdev
+4  A: 
Fabian Steeg
Fantastic book.
Chris
A: 

Check out Scrapy. It's an open source web crawling framework written in Python (I've heard it's similar to Django except instead of serving pages it downloads them). It's easily extensible, distributed/parallel and looks very promising.

I'd use Scrapy, because that way I could save my strengths for something more trivial like how to extract the correct data from the scraped content etc and insert into a database.

Hannson
A: 

Nutch Crawler

bill
+1  A: 

Actually writing a scale directed crawler is quite a challenging task. I implemented one at work and maintained it for quite a while. There are a lot of problem that you don't know exist until you write one and hit the problems. Specifically dealing with CDNs and friendly crawling of sites. Adaptive algorithms are very important or you will trip DOS filters. Actually you will anyhow without knowing it if your crawl is big enough.

Things to think about:

  • What's except able throughput?
  • How do you deal with site outages?
  • What happens if you are blocked?
  • Do you want to engage in stealth crawling (contreversial and actually quite hard to get right)?

I have actually written some stuff up that if I ever get around to it I might put online about crawler construction since building a proper one is much tougher than people will tell you. Most of the open source crawlers work well enough for most people so if you can I recommend you use one of those. Which one is a feature/platform choice.

Steve