When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?

views:

answers:

+2 Q:

When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?

I'm scraping a website (scripting responsibly by throttling my scraping and with permission) and I'm going to be gathering statistics on 300,000 users.

I plan on storing this data in a SQL Database, and I plan on scraping this data once a week. My question is, how often should I be doing inserts on the database as results come in from the scraper?

Is it best practice to wait till all results are in (keeping them all in memory), and insert them all when the scraping is finished? Or is it better to do an insert on every single result that comes in (coming in at a decent rate)? Or something in between?

If someone could point me in the right direction on how often/when I should be doing this I would appreciate it.

Also, would the answer change if I was storing these results in a flat file vs a database?

Thank you for your time!

+3 A:

You might get a performance increase by batching up several hundred, if your database supports inserting multiple rows per query (both MySQL and PostgreSQL do). You'll also probably get more performance by batching multiple inserts per transaction (except with transactionless databases, such as MySQL with MyISAM).

The benefits of the batching will rapidly fall as the batch size increases; you've already reduced the query/commit overhead 99% by the time you're doing 100 at a time. As you get larger, you will run into various limits (example: longest allowed query).

You also run into another large tradeoff: If your program dies, you'll lose any work you haven't yet save to the database. Losing 100 isn't so bad; you can probably redo that work in a minute or two. Losing 300,000 would take quite a while to redo.

Summary: Personally, I'd start with one value/one query, as it'll be the easiest to implement. If I found insert time was a bottleneck (very much doubt it, scrape will be so much slower), I'd move to 100 values/query.

PS: Since the site admin has given you permission, have you asked if you can just get a database dump of the relevant data? Would save a lot of work!

derobert 2009-09-09 06:09:28

+1 or at least an XML data dump as a web service...

Bill Karwin 2009-09-09 06:17:39

+1 A:

My preference is to write bulk data to the database every 1,000 rows, when I have to do it the way you're describing. It seems like a good volume. Not too much re-work if I do have a crash and need to re-generate some data (re-scraping in your case). But it's a good healthy bite that can reduce overhead.

As @derobert points out, wrapping a bunch of inserts in a transaction also helps reduce overhead. But don't put everything in a single transaction - some vendors of RDBMS like Oracle maintain a "redo log" during a transaction, so if you do too much work this can cause congestion. Breaking up the work into large, but not too large, chunks is best. I.e. 1,000 rows.

Some SQL implementations support multi-row INSERT (@derobert also mentions this) but some don't.

You're right that flushing raw data to a flat file and batch-loading it later is probably worthwhile. Each SQL vendor supports this kind of bulk-load differently, for instance LOAD DATA INFILE in MySQL or ".import" in SQLite, etc. You'll have to tell us what brand of SQL database you're using to get more specific instructions, but in my experience this kind of mechanism can be 10-20x the performance of INSERT even after improvements like using transactions and multi-row insert.

Re your comment, you might want to take a look at BULK INSERT in Microsoft SQL Server. I don't usually use Microsoft, so I don't have first-hand experience with it, but I assume it's a useful tool in your scenario.

Bill Karwin 2009-09-09 06:16:30

I'm using MSSQL, so I'm not sure how that will effect things.

Mithrax 2009-09-09 06:18:15

ansaurus

tags:

views:

answers:

When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?

related questions