views:

387

answers:

4

In Episode 78 of the Joel & Jeff podcast one of the Doctype / Litmus guys states that you would never want to build a spider in ruby. Would anyone like to guess at his reasoning for this?

+1  A: 

You wouldn't get the desired performance out of Ruby. See the referenced link: http://blog.dhananjaynene.com/2008/07/performance-comparison-c-java-python-ruby-jython-jruby-groovy/

While performance tests like these should be taken with a grain of salt, there is a considerable difference between Ruby and the top(in speed) languages.

Edit: Shame on me for answering a loaded question. All-in-all choosing a language is a series of trade offs spanning from performance to personal preferences on what you are efficient in. The beauty of programming is that all of these languages are available for you to use, so you can test what works best for the requirements of your project. My recommendation is to experiment and see what works best for you.

OG
eI thought the speed may be an issue, though why does a spider need to be fast? Surely a lot of what they are doing is network bound? What would be a more suitable language?
Ben
A decent spider will multitask (multi-thread) so that while some thread is waiting for a server to wake up, others will be busy querying. Depending on your Internet connection, you should be able to max out your CPU, especially since hopefully you'll be doing something meaningful and computationally demanding with page content once you've downloaded it.
Carl Smotricz
And the specific Josephus algorithm that he was profiling doesn't on its face have a lot to do with the spider problem. Performance may be the reason, and ruby may be too slow, but this particular link has nothing to do with that (as the author is at pains to say at the top of the post). Spidering is dominated by network traffic, while the Josephus problem is dominated by counting.
Rob Napier
I'm sure many contemporary spiders are written in C/C++, but Java, especially using nio (channels/select) should be fairly suitable too. Scala performs well and is more, umm, sexy. the Windows crowd may favor C#. There may be JIT interpreters that make Perl fast too, but I'm not sure. Some rarer geniuses might use Haskell, Erlang or a Lisp dialect. If this sounds like "anything but Ruby," that's kinda close to the truth.
Carl Smotricz
I don't know the specific operations involved in web spiders, but I know the entire text of the document would have to be parsed which on it's own would take considerable CPU time when considering the scale that a web spider has to deal with. I'd assume something like C, C++, C#, or Java would be faster.
OG
OG, I'm sorry to have asked a loaded question. I did not really intend to do so. I was genuinely interested as to why ruby was not a good choice. Though I knew about the ruby speed thing, I thought it could equally have been a lack of standard library support or some other issue.
Ben
+1  A: 

What OG said. In simpler terms, Ruby is dog slow and if you're looking to get a lot done per unit time, it's the wrong choice of language.

Carl Smotricz
+6  A: 

Just how fast does a crawler need to be, anyhow? It depends upon whether you're crawling the whole web on a tight schedule, or gathering data from a few dozen pages on one web site.

With Ruby and the nokogiri library, I can read this page and parse it in 0.01 seconds. Using xpath to extract data from the parsed page, I can turn all of the data into domain specific objects in 0.16 seconds. All 223 rows.

I am running into fewer and fewer problems where the traditional constraints (cpu/memory/disk) matter. This is an age of plenty. Where resources are not a constraint, don't ask "what's better for the machine." Ask "what's better for the human?"

Wayne Conrad
A crawler doesn't need to be very fast if you're only looking at a single page, but there's a reason Google still uses C. When you multiply a couple milliseconds of savings across a million machines over and over again, it quickly starts adding up.
kejadlen
+1  A: 

In my opinion it's just a matter of scale. If you're writing a simple scraper for your own personal use or just something that will run on a single machine a couple of times a day, then you should choose something that involves less code/effort/maintenance pains. Whether that's ruby is a different question (I'd pick Groovy over Ruby for this task => better threading + very convenient XML parsing). If, on the other hand, you're scraping terabytes of data per day, then throughput of your application is probably more important than shorter development time.

BTW, anyone that says that you would never want to use some technology in some context or another is most probably wrong.

psyho