Good library/platform for a real-time/parallel HTTP crawler?

views:

145

answers:

Good library/platform for a real-time/parallel HTTP crawler?

I am building a crawler that fetches information in parallel from a number of websites in real-time in response to a request for this information from a client. I need to request specific pages from 10-20 websites, parse their contents for specific snippets of information and return this information to the client as fast as possible. I want to do it asynchronously, so the client gets the first result displayed as soon as it is ready, while the other requests are still pending.

I have a Ruby background, and would therefore prefer to build the solution in a Ruby - however, parallelism and speed is exactly what Ruby is known NOT to excel at. I believe that libraries such as EventMachine and Typhoeus can remedy that, but I am also strongly considering node.js, because I know javascript quite well and seems to be built for this kind of thing.

Whatever I choose, I also need an efficient way to communicate the results back to the client. I am considering plain AJAX (but that would require polling the server), web sockets (but that would require fallback for older browsers) and specific solutions for persistent client/server communication such as Cramp, Juggernaut and Pusher.

Does anyone have any experience and/or recommendations they would like to share?

+1 A:

node is definitely capable of handling this type of task - async socket and http communciation is baked in and really pleasant to work with.

Most of my work is j/Ruby and I have found the transition to server-side JavaScript pretty painless - years of web dev mean I know js pretty well and the server development concepts are largely the same regardless of language.

In terms of communication Socket.io is an excellent client and server framework for handling socket communication in node - it supports flash, ajax and websocket channels which means it can be used on just about any modern (and some older) browsers.

Toby Hede 2010-08-24 11:34:34

+1 for jruby since it can handle true multi-threaded, though I suppose 1.9 would work as well

rogerdpack 2010-08-24 13:15:09

If your crawler needs Javascript support, I recommend http://htmlunit.sourceforge.net/.
There is a JRuby wrapper available http://celerity.rubyforge.org/

Features (taken from site) include:

Fast - No time-consuming GUI rendering or unessential downloads
Easy to use - Simple API
JavaScript support
Scalable - Java threads lets you run tests in parallel
Portable - Cross-platform thanks to the JVM
Unintrusive - No browser window interrupting your workflow (runs in background)

z5h 2010-08-26 17:16:20

ansaurus

tags:

views:

answers:

Good library/platform for a real-time/parallel HTTP crawler?

related questions