views:

160

answers:

2

Im trying to scrape data from IMDB, but naturally there are a lot of pages, and doing it in a serial fashion takes way too long. Even with I do multi-threaded CURL.

Is there a faster way of doing it?

Yes I know IMDb offers text files, but they dont offer everything, in any sane fashion.

+1  A: 

I've done a lot of brute force scraping with PHP and sequential processing seems to be fine. I'm not sure "what a long time" to you is, but I often do other stuff while it scrapes.

Typically nothing is dependent on my scraping in real time, its the data that counts, and I usually scrape it and massage it at the same time.

Other times I'll use a crafty wget command to pull down a site and save locally. Then have a PHP script with some regex magic extract the data.

I use curl_* in PHP and it works very good.

You could have a parent job that forks child processes providing them URL's to scrape, which they process and save the data locally (db, fs, etc). The parent is responsible for making sure the same URL isn't processed twice and children don't hang.

Easy to do on linux (pcntl_, fork, etc), harder on windows boxes.

You could also add some logic to look at the last-modified-time and (which you previously store) and skip scraping the page if not content has changed or you already have it. There's probably a bunch of optimization tricks like that you could do.

Mr-sk
Yep forking sounds like the best option to me.
Tim
A: 

If you are properly using cURL with curl_multi_add_handle and curl_multi_select there is no much you can do. You can test to find an optimal number of handles to process for your system. Too few and you will leave your bandwidth unused, too much and you will loose too much time switching handles.

You can try to use master-worker multi process pattern to have many script instances running in parallel, each one using cURL to fetch and later process block of pages. Frameworks like http://gearman.org/?id=gearman_php_extension can help in creating elegant solution but using process control functions on Unix or calling your script in the background (either via system shell or over non-blocking HTTP) can also work well.

Goran Rakic