tags:

views:

87

answers:

3

I know we can grab information (with php) from any site and create own.

I'm talking about parsing some additional content like movie information (dates, budget, persons, etc) or video file properties from youtube (size, duration).

I'm excited on realizing of grabbing process from big sites and large amounts of information.

Seems there are several problems:

  1. Time of script execution. Seems we can make a rotation script to grab all the pages from one to another and push the content to our mysql base, but on a big number of pages execution time will be more than ordinary hosting provides (usually nearly 30 seconds), so the script will die on some moment.
  2. Amount of memory. Script will eat a much memory during parsing of a big number of pages.
  3. Antiddos? on located site (much queries from one ip address).

The main idea of this question is how to get round all these stones and make a rotational script (which can work all day long) without errors.

Are there some other bad news we can get during process?

Your thoughts?

A: 

I'm talking about parsing some additional content like movie information (dates, budget, persons, etc) or video file properties from youtube (size, duration).

both imdb and youtube have API's to get data from their website, no need to scrape.

iggnition
Some sites don't provide, thats why we need scraping in some cases. Most of users will you copy+paste, I prefer to write some script and have no headache.
Happy
See my answer, any solution comes down to whether it is legal or not and then a simple question of volume. Your question clearly states that you plan to be grabbing large volumes of data, therefore the first port of call has to be to contact the sites in question. If you can't or won't do that, then you have to expect your IP address / scraping application to be blocked by all sites in question.
Paul Hadfield
+2  A: 

I will answer this assuming that what you are doing is legal and going to add value to the data that is readily available. If that is the case, you can contact the sites in question and speak to them to confirm you screen scraping won't get blocked as a DoS attack. You can give them your IP addresses, etc. and everything will be fine.

There are many ways to make sure your process won't time out / use too much information. That just comes down to the design of your system. If the content of your site won't be original, please try to make the solution your own at least :) However if you run into specific issues during your implementation I'm sure you could get answers for focused questions.

Edit for clarification

My answer to your question is

1) Check with the sites you wish to scrape. If they have no problems they will not block your IP address - you can arrange a way to make sure this does not happen with them. Either use a static IP address. Or if the IP address you use may change, then agree a particular user agent string.

2) Once you've done (1) then start developing a solution. Execution time, etc. shouldn't be a problem, so if you encounter particular issues with your solution as you are coding it, then come back to stack overflow with a question focused on that one issue.

To be clear, if you can not or will not contact the sites you wish to scrape please tell us all now.

Paul Hadfield
There are many ways - any examples on given questions?
Happy
Sorry, I'm guessing that English isn't your primary language, but your comment does not make sense. I have answered your question (several times). 1) Check with the sites you wish to scrape. If they have no problems they will not block your IP address - you can arrange a way to make sure this does not happen with them (maybe agree a particular user agent string if your IP adress may change). If you can not do this, then don't waste any more time with this question. 2) Then start developing a solution to do this yourself and come back with any issues as you encounter them.
Paul Hadfield
A: 

As @paulHadfield said, before you do anything, you need to ask the owner(s) of the website you want to scrape so you won't be mistaken for a DoS attack.

And what exactly are you trying to store in mysql?

James Eggers