views:

35

answers:

3

I have to run a scraping task to collect data for my App Engine (Java) app.

I'm not sure which is best - scrape data in development mode and upload it to prod or scrape it while the app is running in production.

Does it make a difference?

Are there any difficulties with bringing large quantities of data from one environment to the other (dev->prod or prod->dev)?

+4  A: 

The dev server itself probably isn't a great scraping tool; it's single-threaded and (at least for python; the java implementation might be drastically different) the datastore is fairly horrible when storing large amounts of data.

However, depending on what you're scraping, the production servers might not be well-suited to the task; if the sites can take longer than 10 seconds to respond to a request, the urlfetch API will timeout. If you can be sure that this won't be a problem, it's probably more convenient to do the scraping in production and write directly to the datastore.

If not, it might make sense to do the scraping with a standalone tool and then put the data into the production datastore either with a RESTful web service or the remote API.

Wooble
@Wooble +1 the dev server is a toy.
systempuntoout
A: 

Look at this question how to configure remore API for Java to use Phyton bulk data loader. You can also write a custom loader.

Eugene Kuleshov
+1  A: 

I find that spiders running in production often time out. Your solution of using the dev server is a good one, but also consider implementing each fetch through taskqueue.

vonkohorn