views:

298

answers:

2

I need to scrape(using scrAPI) 400+ web pages ruby, my actual code is very sequential:

data = urls.map {|url| scraper.scrape url }

Actually the code is a bit different (exception handling and stuff).

How can I make it faster? How can I parallelize the downloads?

+5  A: 
th = []
data = []
dlock = Mutex.new

urls.each do |url|
  th << Thread.new(url) do |url|
    d = scraper.scrape url
    dlock.synchronize { data << d }
  end
end

th.each { |t| t.join }

Tada! (Caution; written from memory, not tested, may eat your kitten, etc)

Edit: I figured someone must have written a generalised version of this, and so they have: http://peach.rubyforge.org/ -- enjoy!

womble
A: 

This is pretty much an example used in the Pickaxe explanation of threading:

http://www.rubycentral.com/pickaxe/tut_threads.html

You should be able to adapt the Pickaxe code trivially to use your scraper.

runako